ECONOMETRICS AN INTRO
DUCTION TO MAXIMUM
LIKELIHOOD METHODS
S. Valavanis
Published on demand by
UNIVERSITY MICROFILMS
University Microfilms Limited, High Wycomb, England
A Xerox Company, Ann Arbor, Michigan, U.S.A.
B&
m
9
mWMm
WBHHBUm
mm
HUME LIBRARY
INSTITUTE OF FOOD AND
AGRICULTURAL SCIENCES
UNIVERSITY OF FLORIDA
Gainesville
m
/>
'J
Digitized by the Internet Archive
in 2013
http://archive.org/details/econometricsOOvala
This is an authorized facsimile of the original
book, and was produced in 197^ by microfilm
xerography by Xerox University Microfilms,
Ann Arbor, Michigan, U.S.A.
EC©N©METItIC§
An Introduction to Maximum Likelihood Methods
STEFAN VALAVANIS
Assistant Professor of Economics
Harvard University, 1956 to 1958
EDITED, FROM MANUSCRIPT, BY
ALFRED H. CONRAD
Assistant Professor of Economics
Harvard University
D , u) '
y I* 7
1959
New York Toronto London
McGRAWHILL BOOK COMPANY, INC,
h
i
ECONOMICS HANDBOOK SERIES
SEYMOUR E. HARRIS, Editor
Advisory Committee: Edward H. Chamberlain, Gottfried
Haberler, Alvin H. Hansen, Edward S. Mason, and John H.
Williams. All of Harvard University,
Burns • Social Security and Public Policy
Duesenberry • Business Cycles and Economic Growth
Hansen • The American Economy
Hansen • A Guide to Keynes
Hansen • Monetary Theory and Fiscal Policy
Harris • International and Interregional Economics
Henderson and Quandt • Microeconomic Theory
Hoover • The Location of Economic Activity
Kindleberger • Economic Development
Lerner • Economics of Employment
Valavanis • Econometrics
ECONOMETRICS
ECONOMETRICS. Copyright © 1059 by the McGrawHill Book Company,
Inc. Printed in the United States of America. All rights reserved. This
book, or parts thereof, may not be reproduced in any form without permis
sion of the publishers. Library of Congress Catalog Card Number 5814363
THE MAPLE PRESS COMPANY, YORK, PA.
Editor's introduction
For years many teachers of economics and other professional
economists have felt the need of a series of books on economic subjects
which is not filled by the usual textbook, nor by the highly technical
treatise.
This present series, published under the general title of Economics
Handbook Series, was planned with these needs in mind. Designed
first of all for students, the volumes are useful in the evergrowing field
of adult education and also are of interest to the informed general
reader.
The volumes present a distillate of accepted theory and practice,
without the detailed approach of the technical treatise. Each volume
is a unit, standing on its own.
The authors are scholars, each writing on an economic subject on
C which he is an authority. In this series the author's first task was not
, 2 to make important contributions to knowledge— although many of
^ them do — but so to present his subject matter that his work as a
„ v «~J scholar will carry its maximum influence outside as well as inside the
cv classroom. The time has come to redress the balance between the
energies spent on the creation of new ideas and on their dissemination.
Economic ideas are unproductive if they do not spread beyond the
world of scholars. Popularizers without technical competence,
v unqualified textbook writers, and sometimes even charlatans control
i too laige a part of the market for economic ideas.
In the classroom the Economics Handbook Series will seive, it is
hoped, as brief surveys in onesemester courses, as supplementary
reading in introductory courses, and in other courses in which the
subject is related.
Seymour E. Harris
> J , v
V
Nl*"
Editor's preface
The editor welcomes Stefan Valavanis' study of econometrics into
the Economics Handbook Series as a unique contribution to eco
nometrics and to the teaching of the subject.
Anyone who reads this book will understand the tragedy of the
death of Stefan Valavanis. He was brilliant, imaginative, and a first
class scholar and teacher, and his death is a great loss to the world of
ideas.
Professor Valavanis had virtually completed his book just befora
his departure for Europe in the summer of 1958. But, as is alwaya
true of a manuscript left with the publisher, though it was essentially
complete much remained to be done. My colleague, Professor Alfred H,
Conrad, volunteered to finish the job. Unselfishly he put the final
touches on the book, went over the manuscript, checked the math
ematics, assumed the responsibility for seeing it through the press,
and helped in many other ways. Without his help, the problem of
publication would have been a serious one. The publisher and editor
are indeed grateful.
This book is an introduction to econometrics, that is, to the tech
niques by which economic theories are brought into contact with the
facts. While not in any sense a "cookbook," its orientation is
constantly toward the strategy of economic research. Within the
field of econometrics, the book is primarily addressed to the problems
of estimation rather than to the testing of hypotheses. It is concerned
with estimating, from the insufficient information available, the values
viii editor's preface
or magnitudes of the variables and relationships suggested by economic
analysis. The maximum likelihood and limited information tech
niques are developed from fundamental assumptions and criteria and
demonstrated by example; their costs in accuracy and computation
are weighed. There are short but careful treatments of identification,
instrumental variables, factor analysis, and hypothesis testing. The
book proceeds much more by statements of problems and examples than
by the development of mathematical proofs.
The main feature of this book is its pedagogical strength. While
rigor is not sacrificed and no mathematical or statistical rabbits are
pulled out of the author's hat, the statistical tools are always presented
in terms of the fundamental limitations and criteria of the real world.
Almost every concept is introduced by an example set in this world of
real problems and difficulties. Mathematical concepts and notational
distinctions are most often handled in clearly set off "digressions."
The fundamental notions of probability and matrix algebra are
reviewed, but in general it is assumed that the student has
already been introduced to determinants and matrices and the
elementary properties and processes of differentiation. (No more
knowledge of mathematics is required than for any of the other
comparable texts, and, thanks to the pedagogical skills of the author,
probably considerably less.) Frequent emphasis is placed upon com
putation design and requirements.
Valavanis' book is brilliantly organized for classroom presentation,
most of the statistical and mathematical assumptions and concepts
being treated verbally and by example before they appear in any
mathematical formulation. In addition to the examples used in
presentation, there are exercises in almost every chapter.
Seymour E. Harris
Preface
This work is neither a complete nor a systematic treatment of
econometrics. It definitely is not empirical. It has one unifying idea:
to reduce to commonsense terms the mathematical statistics on which
the theory of econometrics rests.
If anything in econometrics (or in any other field) makes sensa, one
ought to be able to put it into words. The result may not ba so com
pact as a closeknit mathematical exposition, but it can be, In its
own way, just as elegant and clear.
Putting symbols and jargon into words understandable to a wider
audience is not the only thing I want to do. I think that watering
down a highly refined or a very deep mathematical argument is a
useful activity. For instance, if the essence of a problem can be
captured by two variables, why tackle n? Or why worry about
mathematical continuity, existence, and singularity in a discussion of
economic matters, unless these intriguing properties have interesting
economic counterparts? We would be misspending effort if all the
reader wants is an intelligent layman's idea of what is going on in the
field of econometrics. For the sake of the punctilious, I shall give
warning every time my heuristic "proof" is not watertight or when
ever I slur over an unessential mathematical maze.
Much of econometric literature suffers from overfancy notation. If
I judge rightly, many people quake at the sight of every new issue of
Econometrica. I hope to show them the intuitive good sense that hides
behind the mathematical squiggles.
ix
X PREFACE
Besides restoring the selfassurance of the ordinary intelligent
reader and helping him discriminate between really important develop
ments in econometric method and mere mathematical quibbles, I have
tried to be useful to the teachers of econometrics and peripheral
subjects by supplying them with material in "pedagogic" form. And
lastly, I should like to amuse and surprise the serious or expert eco
nometrician, the connoisseur, by serving him familiar hash in nice new
palatable ways but without loss of nutrient substance.
The gaps in this work are intentional. One cannot, from this book
alone, learn econometrics from the ground up; one must pick up
elementary statistical notions, algebra, a little calculus — even some
econometrics — elsewhere.
For the beginner in econometrics, an approximately correct sequence
would be the books of Beach (1957), Tinbergen (1951), Klein (1953),
and Hood (1953); with Tintner (1942) as a source of examples or
museum for the numerous varieties of quantitative techniques in
existence. Tinbergen emphasizes economic policy; Klein, the busi
ness cycle and macroeconomics; Tintner, the testing of hypotheses
and the analysis of time series. All three use interesting empirical
examples. For elementary mathematics the first part of Beach
(perhaps also Klein, appendix on matrices) is enough. Reference to
all these easily available and digestible texts is meant to avoid my
repeating what has been said by others.
From time to time, however, I make certain "digressions"; these
are held in from the margins. These digressions have to do mostly
with mathematical and statistical subjects that in my opinion are
either inaccessible or not well explained elsewhere.
Stefan Valavanis
Acknowledgments
Harvard University, for grants from the Joseph H. Clark Bequest and
from the Ford Foundation's Small Research Fund in the Department
of Economics.
Professor and Mrs. William Jaff6
Frofessor Arthur S. Goldberger
Professor Richard E. Quandt
Professor Frederick F. Stephan
Contents
Editor's introduction V
Editor's preface vii
Preface IZ
Digressions xv
Frequent references ivii
Chapter 1. The fundamental proposition of econometrics . , . 1
1.1. What econometrics is about . 1
1.2. Mathematical tools 2
1.3. Outline of procedure and main discoveries in the next hundred pages 3
1.4. Allimportance of statistical assumptions 4
1.5. Rationalization of the error term 5
1.6. The fundamental proposition 6
1.7. Population and sample * . 7
1.8. Parameters and estimates 8
1.9. Assumptions about the error term 9
1. u is a random real variable 9
2. ut, for every t, has zero expected value 10
3. The variance of u t is constant over time 13
4. The error term is normally distributed 14
5. The random terms of different time periods are independent . . 16
6. The error is not correlated with any predetermined variable . . 16
1.10. Mathematical restatement of the Six Simplifying Assumptions . . 17
1.11. Interpretation of additivity 18
1.12. Recapitulation 18
Further readings 21
xi
XII CONTENTS
Chapter 2. Estimating criteria and the method of least squares . 23
2.1. Outline of the chapter 23
2.2. Probability and likelihood 24
2.3. The concept of likelihood function 28
2.4. The form of the likelihood function 29
2.5. Justification of the least squares technique 31
2.G. Generalized least squares 33
2.7. The meaning of unbiasedness 35
2.8. Variance of the estimate 39
2.9. Estimates of the variance of the estimate 41
2.10. Estimates ad nam earn 43
2.11. The meaning of consistency 43
2.12. The merits of unbiasedness and consistency 45
2.13. Other estimating criteria 47
2.14. Least squares and the criteria 47
2.15. Treatment of heteroskedasticity 48
Further readings 50
Chapter 3. Bias in models of decay 52
3.1. Introduction and summary 52
3.2. Violation of Simplifying Assumption 6 53
3.3. Conjugate samples 54
3.4. Source of bias 57
3.5. Extent of the bias 58
3.6. The nature of initial conditions 58
3.7. Unbiased estimation CO
Further readings, . C2
Chapter 4. Pitfalls of simultaneous interdependence 63
4.1. Simultaneous interdependence 63
4.2. Exogenous variables 64
4.3. Haavelmo's proposition. 64
4.4. Simultaneous estimation 67
4.5. Generalization of the results 70
4.6. Bias in the secular consumption function 71
Further readings 71
Chapter 5. Manyequation linear models 73
5.1. Outline of the chapter 73
5.2. Effortsaving notation 74
5.3. The Six Simplifying Assumptions generalized 77
5.4. Stochastic independence 78
5.5. Interdependence of the estimates 82
5.6. Recursive models 83
Further readings 84
Chapter 6. Identification 85
6.1. Introduction 85
6.2. Completeness and nonsingularity 86
C0NTENT8 XlH
6.3. The reduced form 87
6.4. Over and underdeterminacy $$
6.5. Bogus structural equations. 89
6.6. Three definitions of exact identification 90
6.7. A priori constraints on the parameters 91
6.8. Constraints on parameter estimates . 94
6.9. Constraints on the stochastic assumptions 95
6.10. Identifiable parameters in an underidentificd equation 96
6.11. Source of ambiguity in overidentified models 98
6.12. Identification and the parameter space 100
6.13. Over and underidentification contrasted . 102
6.14. Confluence 103
Further readings 106
Chapter 7. Instrumental variables 107
7.1. Terminology and results 107
7.2. The rationale of estimating parametric relationships 108
7.3. A single instrumental variable 110
7.4. Connection with the reduced form 113
7.5. Properties of the instrumental variable technique in the simplest case 114
7.6. Extensions 115
7.7. How to select instrumental variables 116
Chapter 8. Limited information 118
8.1. Introduction 118
8.2. The chain of causation 119
8.3. The rationale of limited information . . . 121
8.4. Formulas for limited information 123
8.5. Connection with the instrumental variable method. . . . . . 125
8.6. Connection with indirect least squares 125
Further readings 125
Chapter 9. The family of simultaneous estimating techniques . . 126
9.1. Introduction . 126
9.2. Theil's method of dual reduced forms 126
9.3. Treatment of models that are not exactly identified 128
9.4. The "natural state" of an econometric model 130
9.5. What are good forecatts? 132
Further readings. 134
Chapter 10. Searching for hypotheses cud testing them .... 136
10.1. Introduction 136
10.2. Discontinuous hypotheses 137
10.3. The null hypothesis 138
10.4. Examples of rival hypotheses <8
10.5. Linear confluence i42
10.6. Partial correlation 144
10.7. Standardized variables 145
10.8. Bunch map analysis 146
XIV CONTENTS
10.9. Testing for linearity 150
10.10. Linear versus ratio models 152
10.11. Split sectors versus sector variable 153
10.12. How hypotheses are chosen 154
Further readings 155
Chapter 11. Unspecified factors 15G
11.1. Reasons for unspecified factor analysis 150
11.2. A single unspecified variable 157
11.3. Several unspecified variables 159
11.4. Linear orthogonal factor analysis 1G0
11.5. Testing orthogonality 163
11.6. Factor analysis and variance analysis 164
Further readings 165
Chapter 12. Time series , 166
12.1. Introduction , 166
12.2. The time interval 167
12.3. Treatment of serial correlation 168
12.4. Linear systems „ 171
12.5. Fluctuations in trendless time series 172
12.6. Correlograms and kindred charts 174
12.7. Seasonal variation 176
12.8. Removing the trend. 178
12.9. How not to analyze time series 181
12.10. Several variables and time series 182
12.11. Time aeries generated by structural models 183
12.12. The overall autorcgrcssion of the economy 185
12.13. Leading indicators . 186
12.14. The diffusion index . 188
12.15. Abuse of longterm series 190
12.16. Abuse of coverage 191
12.17. Disagreements between crosssection and time series estimates . . 192
Further readings 195
Appendix A. Layout of computations 197
The rules in detail 198
Matrix inversion . 201
B. Stepwise least squares 203
C. Subsample variances as estimators 207
D. Proof of least squares bias in models of decay .... 209
E. Completeness and stochastic independence 213
F. The asterisk notation. 214
Index 217
Digressions
On the distinction between probability and probability density . . . 9
On the distinction between "population" and "universe" .... 11
On infinite variances 14
On the univariate normal distribution 14
On the differences among moment, expectation, and covarianc©. . . 13
On the multivariate normal distribution 26
On computational arrangement 32
On matrices of moments and their determinants 34
On notation 44
On arbitrary weights 50
On directional least squares 69
On Jacobians . 79
On the etymology of the term "multicollinearity" 105
On correlation and kindred concepts . 141
On moving averages and sums 188
xv
Frequent references
Names in small capital letters refer to the following works:
Allen R. G. D. Allen, Mathematical Economics. New York: St. Mirtiis's
Press, Inc., 1956. xvi, 768 pp., illus.
Beach Earl F. Beach, Economic Models: An Exposition. N§w fork:
John Wiley & Sons, Inc., 1957. xi, 227 pp., illus.
Hood William C. Hood and Tjalling C. Koopmans (eds.), Studies in
Econometric Method, Cowles Commission Monograph 14. New York:
John Wiley & Sons, Inc., 1953. xix, 323 pp., illus.
Kendall Maurice G. Kendall, The Advanced Theory of Statistics , voll, I, II.
London: Charles Griffin & Co., Ltd., 1943; 5th ed„ 1952. Vol. I, m 4$7
pp., illus.; vol. II, vii, 521 pp. illus.
Klein Lawrence R. Klein, A Textbook of Econometrics. Evanston: Row*
Peterson & Company, 1953. ix, 355 pp., illus.
Koopmans Tjalling C. Koopmans (ed.), Statistical Inference in Dynawie
>w Economic Models, Cowles Commission Monograph 10. New Yofkj
John Wiley & Sons, Inc>, 1950. xiv, 438 pp., illus.
Tinbergen Jan Tinbergen, Econometrics, translated from the Dutch by
H. Rijken van Olst. New York: McGrawHill Book Company, Inc?.,
Blakiston Division, 1951. xii, 258 pp., illus.
Tintner Gerhard Tintner, Econometrics. New York: John Wiley ft
Sons, Inc., 1952. xiii, 370 pp.
xvii
CHAPTER 1
The fundamental proposition
of econometrics
1.1. What econometrics is about
An econometrician's job is to express economic theories in mathe
matical terms in order to verify them by statistical methods, and to
measure the impact of one economic variable on another so as to be
able to predict future events or advise what economic policy should bs
followed when such and such a result is desired.
This definition describes the major divisions of econometrics, namely,
specification, estimation, verification, and prediction.
Specification has to do with expressing an economic theory in mathe
matical terms. This activity is also called model building. A model
is a set of mathematical relations (usually equations) expressing an
economic theory. Successful model building requires an artist's touch,
a sense of what to leave out if the set is to be kept manageable, elegmit,
and useful with the raw materials (collected data) that are available.
This book deals only incidentally with the "specification" aspect of
econometrics.
The problem of estimation is to use shrewdly our all too scanty data,
1
2 THE FUNDAMENTAL PROPOSITION OF ECONOMETRICS
so as to fill the formal equations that make up the model with numerical
values that are good and trustworthy. Suppose we have the following
simple theory to quantify: Consumption today (C) depends on yester
day's income (Z) in such a way that equal increments of income, no
matter what income level you start from, always bring equal increments
in consumption. Letting a stand for consumption at zero income and
7 for the_ marginal^ pjopensjty_tQ„iiQnsume, this theory can be expressed,,
thus:
C t = a + yZ t (11)
The problem of estimation (and the main concern of this book) is to
discover how to use whatever experience we have about consumption
C and income Z in order to make a shrewd guess about how large a
and 7 mi^.ht really be. The problem of estimation is to guess correctly
a and 7, t he 'parameters (or inherent characteristics) of the consumption
junction.
J&rinLe&timaU&n is making the best possible single guess about a and
about 7. Interval estimation is guessing how far ojir^uessjo^^may be
from the true ^J^nd our guess of 7 from the true_x*
It is not enough, of course, to be able to make correct point and
interval estimates. We want to make them as cheaply as possible.
Wg.HSJl^^^^^^Lfej^cfe.Wt. programming of computations , checks of
accuracy, and shozLSUlS Though this aspect of estimation will not
occupy us very much, I shall give some computational advice from
time to time.
Verification sets up criteria of success and uses these criteria to
accept or reject the economic theory we are testing with the model and
our v data. It is a tricky subject deeply rooted in the mathematical
theory of statistics.
Prediction involves rearranging the model into convenient shape, so
that we can feed it information about new developments in exogenous
and lagged variables and grind out answers about the impact of these
variables on the endogenous variables.
1.2. Mathematical tools
In explaining how to fashion good estimates ft»r the parameters of an
econometric model I shall often step into th$ mathematical statis
1.3. OUTLINE OP PROCEDURE 3
tician's toolroom to bring out one gadget or another required by the
next step of our procedure. These digressions are clearly marked m
they can be skipped by those acquainted with the tool in question.
The mathematical tools used again and again are elementary:
analytic geometry, which makes equations and graphs interchangeably;
probability, a concept enabling us to make precise statements ..about
uncertain events; the derivative (or the operation of differentiating),
which is a help in making a "best" guess among all possible guessej;
moments, which are a sophisticated way of averaging various magni
tudes; and matrices, which are nothing but manydimensional ordinary
numbers — indeed, statements that are true of ordinary numbers
seldom fail for matrices — for instance, you can add, subtract, multiply,
and divide matrices analogously to numbers and, in general, handle
them as if they were ordinary numbers though perhaps more fragile; a
vector is a special kind of matrix.
1.3. Outline of procedure and main discoveries in the
next hundred pages
I. We shall deal first with models consisting of a single equation.
We shall find that even in this simple case there are important
difficulties.
A. It is not always possible to estimate the parameters of even a
singleequation model, for two sorts of reason:
1. f e may lack enough data. This is called the 'problem of
degrees of freedom.
2. Though the data are plentiful, they may not be rich or varied
enough. This is the problem of ' multicollinegjtifa.
B. Our second important finding will be that "pedestrian"
methods of estimation, for example the least squares fit, are
apt to be treacherous. They either give us erroneous impres
sions about the true values of the parameters or waste the data.
II. Turning then to models containing two or more equations, our
main findings will be the following:
A. It is sometimes impossible to determine the value of each
parameter in each equation, but this time not merely for lack
of data or their monotony, but rather because the equations
look too much like one another to be disentangled. Econ
4 THE FUNDAMENTAL PROPOSITION OP ECONOMETRICS
ometricians call this undesirable property lack of identifiability.
B. Nonpedestrian, statistically sophisticated methods become very
complex and costly to compute when the model increases from
a single equation even to two.
C. Happily, however, by sacrificing some of the rigor of these ideal
" equestrian " methods in special, shrewd ways, we can cut
the burden of computation by a factor of 5 or 10 and still get
pretty good results. Such techniques are called limited
i^ormation^Jzchmj&uej, because they deliberately disregard
refinements that should ideally be taken into account. Most
theoretical econometricians work in this field, because the need
is very great to know not only how to boil down complexity
with clever tricks but also precisely how much each trick costs
us in accuracy.
1*4. Allimportance of statistical assumptions
The key word in estimation is the word stgdwslic. ItSwQPJiQSJteJis^
exact or systematic.
Stochastic comes from the Greek stokhos (a target, or bull'seye).
Tjie outcome of throwing darts is a stochastic process, that is to say,
fraught with occasional misses. In economics, indeed in all empirical
disciplines, we do not expect our predictions to hit the bull'seye
100 per cent of the time.
Econometrics begins by saying a great deal more about this matter
of missing the mark. Where ordinary economic theory merely recog
nizes that we miss the mark now and then, econometrics makes
statistical assumptions. These are precise statements about the par
ticular way the darts hit the target's rim or hit the wall. Everything —
estimation, prediction, and verification — depends vitally on the content
of the statistical assumptions. Econometric models emphasize this
TacTby using a special variable jw called the error term. The error
term varies from instance to instance, just as one dart falls above,
another below, one to the left, another to the right, of the target. A
subscript t serves to indicate the various values of the error term. To
make model (11) stochastic, we write
C t = a + yZ t + u t (12)
1.5. RATIONALIZATION OP THE ERROR TERM 5
Before going on to rationalize the presence of the error term u m
equation (12), two things must be explained. First, u could have been
included as a multiplicative factor or as an exponential rather than as
an additive term. Second^ its subscript t need not express time. It
can refer just as well to various countries or income classes.
To facilitate the exposition, I shall henceforth treat u t as an additive
term and take / to represent time. Exceptions will be clearly labeled.
The commonsense interpretation of additivity is deferred to Sec. 1.11.
1.5. Rationalization of the error term
There are four types of reasons why an econometric model should be
stochastic and not exact: incomplete theory, imperfect specification^
aggregation of data, and errors of measurement. Not all of them
apply to every model.
1. Incomplete theory
A theory is necessarily incomplete, an abstraction that cannot
explain everything. For instance Jaimoimple theory of consumption*,
a. We have left out possible variables, like wealth and liquid assets^
that also affect consumption.
b. We have left out equations. The economy is much more complex
than a single equation, no matter how many explanatory variables this
single equation may contain; there may be other links between con
sumption and income besides the consumption function. 1
c. Human behavior is "ultimately" random.
2. Imperfect specification
We have linearized a possibly nonlinear relationship.
3. Aggregation of data
We have aggregated over dissimilar individuals. Even if each of
them possessed his own a and y and if his consumption reacted in
exact (nonstochastic) fashion to his past income, total consumption
would not be likely to react exactly in response to a given total income,
because its distribution may change. Another way of putting this is:
1 How many independent links there may be and how we are to find them is
itself a problem in statistical inference, and is treated briefly in Chap. 10.
6 THE FUNDAMENTAL PROPOSITION OP ECONOMETRICS
Variables expressing individual peculiarities are missing (cf. la). Or
this way: Equations that describe income distribution are missing
(cf. 16).
4. Errors of measurement
Even if behavior were exact, survey methods are not, and our
statistical series for consumption and income contain some errors of
measurement. Throughout this book we pretend that all variables are
measured without error.
1.6. The fundamental proposition
All we get out of an econometric model is already implied in:
1. Its specification; that is to say, consumption C depends on yester
day's income Z as in the equation C% = a + yZ t + u t and in no other
way 1
2. Our assumptions concerning u, that is to say, the particular way
we suppose the relationship between C and Z to be inexact 2
3. Our sampling procedure, namely, the way we arrange to get data
4. Our sample, i.e., the particular data that happen to turn up after
we decide how to look for them
5. Our estimating criterion, i.e., what properties we desire our
estimates of a and 7 to have, short of the unattainable: absolute
correctness
Over items 1, 2, 3, and 5 we have absolute control; for we are free to
change our theory of consumption, our set of assumptions concerning
the error term, our datacollecting techniques, and our estimating
criterion. We have no control over item 4; for what data actually
turn up is a matter of luck.
According to this fundamental proposition, what estimates we get
for the parameters (a and 7) depends, among other things, on the
stochastic assumptions, i.e., what we choose to suppose about the
behavior of the error term u. Every set of assumptions about
the error term prescribes a certain way of guessing at the true value of
the parameters. And conversely, every guess about the parameters is
Implicitly a set of stochastic assumptions.
! This is also called "structural specification."
5 This is also called "stochastic specification."
1.7. POPULATION AND SAMPLE 7
The relationship between stochastic assumptions and parameter
estimates is not always a onetoone relationship. A given set of
stochastic assumptions is compatible with several sets of different
parameter estimates, and conversely. In practice, we don't have to
worry about these possibilities, because we shall be making assump
tions about u that lead to unique guesses about a and 7, or, at the
very worst, to a few different guesses. Also, in practice, since we
usually are interested in a and 7, not in verifying our assumptions
about u, it does not matter that many different u assumptions are
compatible with a single set of parameter estimates.
1,7. Population and sample
The whole of statistics rests on the distinction between population
and observation or sample. People were receiving income and con
suming it long before econometrics and statistics were dreamed of.
There is, so to speak, an underlying population of C's and Z's, which
we can enumerate, hypothetically, as follows:
C\, C2, .
. . , Cp, .
. ,c P
or
C p forp  1, 2, . . . ,P
Zi, Z2, .
• > Zpi '
. . ,Z P
or
Z p for p  1, 2, . . . , P
Of these C's and Z'b we may have observed all or some. Those that
we have observed take on a different index s instead of p, to emphasize
that they form a subset of the population. All we observe, then,
is C, and Z„ where the s assumes some (perhaps all) of the values that
p runs, but no more. Index s can start running anywhere, say, at
p = 5, assume the values 6 and 7, skip 8, 25, and 92, and stop anywhere
short of the value P or at P itself. In all cases that I shall discuss,
the sample covers consecutive time periods, which are renumbered, for
convenience, in such a way that the beginning of time coincides with
the beginning of the sample, not of the population. Whether the
sample is consecutive or not sometimes does and sometimes does not
affect the estimation of a and 7.
Note that we mean by the term sample a given collection of observe*
tions, like S — (C$>,CioAi; Z 9) Zio,Zu), not an isolated observation.
S is a sample of three observations, the following ones: {C^Z^) } (Cio,^ie),
and (Cn,Zii). Samples made up of a single observation can exist, of
course, but we seldom work with so little observation.
8 THE FUNDAMENTAL PROPOSITION OF ECONOMETRICS
1.8. Parameters and estimates
Another crucial distinction is between a parameter and its estimate.
If the theory is correct, there are, hiding somewhere, the true a and 7.
These we never observe. What we do do is guess at them, basing
ourselves on such evidence and common sense as we may have. The
guesses, or estimates, always wear a badge to distinguish them from
the parameters themselves.
CONVENTION
We shall use three kinds of badge for a parameter estimate: a roof
shaped hat, as in 4, i, to mark maximum likelihood estimates; a bird,
the same symbol upside down, as in a, 7, for naive least squares esti
mates; and the wiggle, as in a, 7, for other kinds of estimates or for
estimates whose kind we do not wish to specify. These types of
estimate are defined in Chap. 2.
The distinction between error and residual is analogous to the
distinction between parameter and estimate. The error u t is never
observed, although we may speculate about its behavior. It always
goes with the real a and 7, as in (12), whereas the residual, which is an
estimate of the error and whose symbol always wears a. distinctive
badge, can be calculated, provided we have settled on a particular
guess (<2,f ) or (a, 7) for the parameters. The value of the error does
not depend on our guessing; it is just there, in the population and,
therefor©, in the sample. The residual, however, depends on the
particular guess. To emphasize this fact we put the same badge on
the residual as on the corresponding parameter estimate. We write,
for example, C t = & + ^Z t + Ut or C t = a J yZ t + U t .
Now we can state precisely what the problem of estimation is, as
follows. We assume a theory, for example, the theory of consumption
C$ = a 4 7^1 + u t ; we assume that Ut behaves in some particular
way (to which the next section is devoted) ; we get a set of observations
on and Z (the sample). Then we manipulate the sample data to
give us estimates a and 7 that satisfy our estimating criterion (dis
cussed in Chap. 2). Then we compute if we wish the residuals U t as
Climates of the errors u t .
1.9. ASSUMPTIONS ABOUT THE ERROR TERM 9
1.9. Assumptions about the error term
Besides additivity (Sec. 1.4) we shall now make and interpret m%
assumptions about the error term. Of these, the first is indispensable,
exactly as it stands. The other five could be different in content or in
number. Note carefully that these are statements about the u's not
the #'s.
Assumption 1. u is a random real variable.
If the model is stochastic, either its systematic variables (eonsump*
tion or income) are measured with errors, or the consumption function
itself is subject to random disturbances, or both. Since we have ruled
out (Sec. 1.5) errors of measurement, the relationship itself km to \m
stochastic.
" Random" is one of those words whose meaning everybody knows
but few can define. Unpardonably, few standard texts glv© its
definition. A variable is random if it takes on a number of different
values, each with a certain probability. Its different values can be
infinite in number and can range all over the field, provided th@r<§ are
at least two of them. For instance, a variable w that is equal to
— y% twentyfive per cent of the time, to 3 + \/2 forty per cent of the
time, and to +35.3 thirtyfive per cent of the time is a random variable.
We may or may not know what values it takes on or their probabili
ties. Its probability distribution may or may not be expressible
analytically. (See Sec. 2.2.)
Digression on the distinction between probability
and probability density
A random variable w can be discrete^ like the number of remain
ing teeth in members of a group of people, or continuoxm, Ukf their
weight. If w takes on a finite number of values, their probabili
ties can be quite simply plotted on and read off a dot di®%?&m or
point graph (Fig. la).
With a continuous variable we can usually speak only of the
probability that its value should lie between (or at) such and
such limits. In this case, we plot a probabilitydensity graph
10
THE FUNDAMENTAL PROPOSITION OF ECONOMETRICS
(Fig. 16). The height of such a graph at a point is the probability
density } and the relative area under the graph between any two
points of the w axis is the probability.
Assumption 2. u t , for every t, has zero expected value.
Naively interpreted, this proposition says that the "average" value
of Wi is zero, that of u 2 is also zero, and so forth. Or, to put it dif
ferently, it says that a prediction like C\ = a + yZi is "on the
average" correct, that the same is true of C2 = a + 7^2, and so forth.
1 
2 4

1
1
1 1*1
1 ! \
\\\
j. _ 1 1 1 1 1
I 3+V2"
35.3 w
5 w
(a)
(6)
Fsg. 1. A random variable, a. Variable w is discrete* The illustration is a dot
diagram, or point graph, b. Variable w is continuous. The illustration is a
probabilitydensity graph.
The trouble begins when you begin to wonder what "on the average"
can possibly mean if you stick to a single instance, like time period 1
(or time period 2). For every event happens in a particular way and
not otherwise. Suppose, for instance, that in the year 1776 (t — 1776)
consumption fell short of its theoretical level a + yZme by 2.24
million dollars, that is, that Ume = —2.24. Obviously, then, the
average value of Wme is exactly —2.24. What could we possibly wish
to convey by the statement that Wine (and every other u t ) has zero
expected value?
One should never identify the concept of expected value with the
concept of the arithmetic mean. Arithmetic mean denotes the sum of
a set of N numbers divided by N and is an algebraic concept. Expected
1.9. ASSUMPTIONS ABOUT THE ERROR TERM 11
value is a statistical or combinatorial concept. You have to imagine an
urn containing a (finite or infinite) number of balls, each with a number
written on it. Consider now all the possible ways one could draw one
ball from such an urn. The arithmetic mean of the numbers that
would turn up if we exhausted all possible ways of drawing one ball is
the expected value.
The random term of an econometric model is assumed to come from
an Urn of Nature which, at every moment of time, contains bails with
numbers that add up to zero.
The commonsense interpretation of Nature's Urn is as follows:
Though in 1776 actual consumption in fact fell short of the theoretical
by 2.24 and no other amount, the many causes that interacted to
produce Wme = —2.24 could have interacted (in 1776) in various other
ways. This, theoretically, they were free to do since they were
random causes. Now, try to think of all conceivable combinations of
these causes — or if you prefer, think of very many 1776 worlds,
identical in all respects except in the combinations of random causes
that generated the random term. Let us have as many such worlds
as there are theoretical combinations of the causes behind the random
term. In some worlds the causes act overwhelmingly to make con
sumption lower than the nonstochastic level a f T^me J in other
worlds the causes act so as to make it greater than a + 7^n?eJ and in a
few worlds the causes cancel out, so that Cm* = « + T^me exactly.
Now consider the random terms of all possible worlds, and (says the
assumption) they will average out to zero.
This interpretation is a conceptual model we can never hope to
prove or disprove. Its chief merit is that it reduces chance and
statistics to the (relatively) easy language and theorems of com
binatorial algebra. Some people take it seriously; others (myself
included) use it for lack of anything better.
Digression on the distinction between "population"
Whether or not we take Nature's Urn seriously, we will be well
advised to acknowledge that we are dealing with three levels of
discourse, not just the two that I called population and sample.
The third and deeper level is called the universe. It contains all
12 THE FUNDAMENTAL PROPOSITION OP ECONOMETRICS
events as they have happened and as they might have happened if
everything else had remained the same but the random shocks.
Level I Sample : things that both happened and^were observed.
It is drawn from
Level II Population : things that happened but were not neces
sarily observed. It is drawn from
Level III Universe: all things that could have happened. (In
the nature of things only a few did.)
CONVENTION
We shall henceforth use four types of running subscript:
8  1, 2, . .
• ,8
for the sample
p = 1, 2, . .
■ ,P
for the population
i=l,2,..
■ ,1
for the universe
t m 1, 2, . .
• ,T
for instances in general, whether
they come from the sample, the
population, or the universe
In a sense the population (C P ,Z P ) of consumption and income
as they actually happened in recorded and unrecorded history is
merely a sample from the universe (C i} Z t ) of all possible worlds.
Naturally, what we call the sample is drawn from the population
of actual events, not from the hypothetical universe of level III.
In most instances it does no harm to speak (and prove theorems)
as if level I were picked directly from level III, not from level II.
The Platonic universe of level III is indeed rather unseemly for
the field of statistics (which is surely, in lay opinion, the most
hardboiled of mathematics, resting solidly on " facts") and has
been amusingly ridiculed by Hogben. 1
The next few paragraphs state why the abstract model of
Nature's Urn is a less appropriate foundation for econometrics
than for statistical physics or biology. But the rest of the book
goes merrily on using the said Urn.
Economic and physical phenomena alike take place in time.
1 Lancelot Hogben, F. R. S., Statistical Theory: The Relationship of Probability,
Credibility, and Error; An Examination of the Contemporary Crisis in Statistical
Theory from a Behaviourist Viewpoint, pp. 98105 (London: George Allen & Unwin
Ltd., 1957).
1.9. ASSUMPTIONS ABOUT THE ERROR TERM 13
In both fields, the statement that u t is a random variable for each
t is inevitably an abstraction, because time runs on inexorably.
In the physical sciences events are deemed "repeatable," or aris
ing from a universe "fixed in repeated samples," primarily
because the experimenter can ideally replicate exactly all system
atically significant conditions that had surrounded his original
event. This is not possible in social phenomena of the " irre
versible" or "progressive" type. Although in the physical
sciences it may be safe to neglect the difference between popula
tion and universe, it is unsafe in econometrics. For, as economic
phenomena take place in time, all other conditions, including
the exogenous variables, move on to new levels, often never to
return. The commonsense phrase " on the average over similar
experiments" makes much more sense in a laboratory science
than in economics.
Nature's Urn also supports maximum likelihood, variance of an
estimate, bias, consistency, and many other notions we shall have
occasion to introduce in later chapters. All these rest on the
notion of "all conceivable samples." The class of all conceivable
samples includes first of all samples of all conceivable sizes; it also
includes all conceivable samples of a given size^ say, 4. A sample
of size 4 may consist of points that actually happened (if so, they
are in the population) ; it also could consist (partly or entirely)
of points in the universe but not in the population. The latter
kind of sample is easy to conceive but impossible to draw, because
the imagined points never "happened." Therefore, even a com
plete census of what happened is not enough for constructing an
exhaustive list of all conceivable samples.
Assumption 3. The variance of u t is constant over time.
This means merely that, in each year, u t is drawn from an identical
Urn, or universe. This assumption states that the causes underlying
the random term remain unchanged in number, relative importance,
and absolute impact, although, in any particular year, one or another
of them may fail to operate.
For simplicity's sake we have assumed no errors of measurement.
In fact there may be some, and their typical size could vary system
atically with time (or with the independent variable Z). If we try
14 THE FUNDAMENTAL PROPOSITION OF ECONOMETRICS
to measure the diameter of a distant star, our error of measurement is
likely to be several million miles; when we measure the diameter of
Sputnik, it can be only a few feet. Likewise, if our data stretch from
3850 to 1050, national income increases by a factor of 20. It is quite
likely that errors of measurement, too, should increase absolutely.
If they do, Assumption 3 is violated, and some of the techniques that
I develop below should not be used.
Digression on infinite variances
The variance of u is not only constant but finite. When u is
normal it is unnecessary to stipulate that its variance is finite,
because all nondegenerate normal distributions have finite vari
ance. There exist, however, nondegenerate distributions with
zero mean and infinite variance, for example, the discrete random
variable
. . . , 16, 8, 4, 4, 8, 16,
with probabilities, respectively,
U» \/« U V l
/
16
H, K, H, He,
The central limit theorem, according to which the sum of N
random variables (distributed in any form) approaches the
normal distribution for large N t is valid only if the original distri
butions have finite variances.
Assumption 4. The error term is normally distributed.
This is a rather strong restriction. We impose it mainly because
normal distributions are easy to work with.
Digression on the univariate normal distribution
The single variable normal distribution is shaped like a sym
metrical twodimensional bell whose mouth is wider than the
mouth of anything you might name. Normal distributions come
tall, medium, and squat (i.e., with small, medium, or large vari
ances). And the top of the bell can be over any point of the
1.9. ASSUMPTIONS ABOUT THE ERROR TERM 15
w axis; that is to say, the mean of the normal can be negative,
zero, or positive, large or small. This distribution's chief charac
teristic is that extreme values are more and more unlikely the
more extreme they get.
For instance, the likelihood that all the people christened
John and living in London will die today is extremely small, and
the likelihood that none of them will die today is equally small.
Now why is this? Because London is not under atomic attack,
the Johns are not all aboard a single bus, not all of them are diving
from the London Bridge, nor were they all born 85 years ago.
Each goes about his business more or less independently of the
others (except, perhaps, fatherandson teams of Johns), some old,
some young, some exposing themselves to danger and others not.
The reason why the probability that w of these Johns will die
today approximates the normal is that there are very many of
them and that each is subjected to a vast number of independent
influences, like age, food, heredity, job, and so forth. This
probability would not be normal if the Johns were really few,
if the causes working toward their deaths were few, or if such
causes were many but linked with one another.
The assumption that u is normal is justified if we can show that the
variables left out of equation (12) are infinitely numerous and not
interlinked. If they are merely very many and not interlinked, then
u is approximately normal. If they are infinitely many but enough
of them are interlinked, then u is not even approximately normal.
We often know or suspect that these variables, such as wealth, liquid
balances, age, residence, and so forth, are quite interlinked and are
very likely to be present together or absent together.
Sometimes the following argument is advanced: In our model of
consumption, the error term u stems from many sources; that is, we
have left out variables, we have left out equations, we have linearized,
we have aggregated, and so on. These are all different operations,
presumably not linked with one another. Therefore, u is normally
distributed.
This argument is, of course, a bad heuristic argument, and it does
not even stand for an existing (but difficult) rigorous argument. It
16 THE FUNDAMENTAL PROPOSITION OF ECONOMETRICS
Is logically untidy to count as arguments for the normality of u, on one
and the same level of discourse, such diverse items as the fact of
linearization and the number of unspecified variables that affect
consumption.
The assumption stands or falls on the argument of many non
interlinked absent variables. Most alternative assumptions cause
great computational grief.
Assumption 5. The random terms of different time periods
are independent.
This assumption requires that in each period the causes that deter
mine the random term act independently of their behavior in ail
previous and subsequent periods. It is easy to violate this assumption.
1. The error term includes variables that act cyclically. If, for
example, we think consumption has a bulge every 3 years because that
is how often we get substantially remodeled cars, this effect should be
introduced as a separate variable and not included in u.
2. The model is subject to cobweb phenomena. Suppose that
consumers in year 1 (for any reason) underestimate their income, so
that they consume less than the theoretical amount. Then in year 2
they discover the error and make it up by consuming more than the
theoretical amount of year 2; and so on.
3. One of the causes behind the random term may be an employee's
vacation, which is usually in force for 2 weeks though the model's
unit period is 1 week. Any such behavior violates the requirement
that the error of any period be independent of the error in all previous
periods,
Assumption 0. The error is not correlated with any
predetermined variable.
To appreciate this assumption, suppose that (for whatever reason)
sellers set today's price p t on the basis of the change in the quantity
sold yesterday over the day before; that is,
Vt = « + 7tet  qtz) + u t
Suppose, further, that the greater (and more evident) the change in q
1.10. MATHEMATICAL RESTATEMENT 17
the more they strive to set a price according to the above rule. Such
behavior violates Assumption 6.
We can think of examples where behavior is fairly exact (small u's)
for moderate values of the independent variable, but quite erratic for
very large or very small values of the independent variable. I am
apt, for instance, to stumble more if there is either too little or too
much light in the room. This, again, violates Assumption 6, because
the error in the stochastic equation describing my motion depends on
the intensity of light.
So we come to the end of our statistical assumptions about the error
u. When in future discussion I speak of u as having "all the Simplify
ing Properties" (or as "satisfying all the Simplifying Assumptions"),
I mean exactly these six.
Certain of these six assumptions can bo checked or statistically
verified from a sample; others cannot. I shall return to this topic
later.
Of these assumptions only Assumption 1 is obligatory. There are
decent estimating procedures for other sets of assumptions.
1.10. Mathematical restatement of the Six Simplifying
Assumptions
1. u t is random for every t: Some p(u) is defined for all % such
that
< p < 1 and J p(u) du = 1
2. The expected value of u t is zero:
tu t = for all t
3. The variance (r„«(0 is constant in time, and finite:
< <T U u{t) = cov (u h u t ) = <Tuu < °° for all t
4. u t is normal:
p(u)  (2t)>* det (*«u)>* exp [^(w  ew)((r„«) I (w  8w)]
I explain this fancy notation in the next chapter. I use it because it
18 THE FUNDAMENTAL PROPOSITION OF ECONOMETRICS
generalizes very handily into many dimensions. The usual way to
write the normal distribution is
The symbol <r„ u is the square of cr; <r tttt is the variance of u.
5. u is not auto correlated:
Z(u t> u t e) = for all t and for 6 j£
6. u is fully independent of the variable Z\
cov (u t ,Zte) = for all t and all
1.11. Interpretation of additivity
The random term u appears in model (12) as an additive term.
This fact rules out interaction effects between u and Z. Absence of
interaction effects means that, no matter what the level of income Z
may be, a random term of a given magnitude always has the same
effect on consumption. Its impact does not depend on the level of
income.
1.12. Recapitulation
We must be very clear in econometrics, as well as in other areas of
statistical inference, about what is assumed, what is observed, and
what is guessed, and also about what criterion the guess should satisfy.
Table 1.1 provides a check list of things we accept by assumption,
things we can and cannot do, and things we must do in making
statistical estimates. The items in the first three columns have been
introduced in this chapter; the estimating criteria in the fourth column
will be discussed in the next chapter.
Digression on the differences among moment,
expectation, and covariance
Consider two variables, consumption c and yesterday's income
2. They may or may not be functionally related. They have a
1.12. RECAPITULATION
19
Table 1.1
Thebb things are
ASSUMED
These things are
observed
These things
ARE NOT
OBSERVED
This is
imposed
These things aru
computed by us
That a true a and a
true 7 exist
The true a
and 7
Some estimat
ing criterion
for comput
ing a, 7, S
a, a guess as to «
7, a guess as to 7
That a true «< exists
in each time period
The true u<
The residuals SU
(4  1,2, , , , ,B)
That ut has the
Six
Simplifying
Properties
«u, the ex
pected value
ff«u, the vari
ance of the
error
(%%,)/$, the mean
of the rosiduals 5*
tnzz, the moment of
the residuals h.
That there is a
universe d, Zi
(t  1,2 /)
in which
Ci " a + yZi + W
The C's and Z'»
of the sample
denoted by C, Z,
(«  1,2 S)
The C's and
Z'b not in
the sample
universe
U  f Cl '
• • >
• • 9
Expectation
The average value of c in the universe is symbolized by Zc (read
"expected value of c" or "expectation of o"). Similarly, e« is
the average z m J/te universe,
Covariance
The covariance of c and z is defined as the expected value
Efe  ic)(*  es)
where i runs over J/te entire universe. This is symbolized by
cov (c,0) or <r Cf .
Variance
The variance of c is simply the. covariance of c and c. It is
written var c, or cov (c,c), or (r cc .
20 THE FUNDAMENTAL PROPOSITION OF ECONOMETRICS
Now consider a specific sample S° made up of specific (corre
sponding) pairs of consumption and income, for instance,
[C27, C64, ClOS I
Z27, 254, Z105J
Let the sample means for this particular sample be written c°
and z°, respectively.
Moment
The moment (for sample S°) of c on t is defined as the expected
value
E($  c')(z 9  2?)
where s runs over 27, 54, and 105 only. It is symbolized by
m c . e (S°) or m c . f or simply m ct . Of course, a different sample S 1
would give a different moment m c . t (S l ).
Expectation of a moment
Now consider all samples of size 3 that we can draw (with
replacement) from the universe U. Then the expectation of m C9
is the average of the various moments m e . M (S°), m c .,(S 1 ), etc.,
when all conceivable samples of size 3 are taken into account.
A universe with J elements generates f ) such samples, and the
means c and z of the two variables vary from sample to sample.
The expectation of m ct for samples of size 4 is, in general, a
different value altogether.
Much confusion will be avoided later on if these distinctions
are kept in mind. Clear up any questions by doing the exercises
below.
Exercises
l.A If c, c', c", c* are four independent drawings (with replace
ment) from the universe U,^prove~ that e(c' f c" + c*) — 3 ec.
l.B If c, z } q are variables and A; is a constant, which of the follow
ing relations are identities?
21
cov (c,z) = cov («,c) • J.
'met = m»e •
e(c f z) = sc f ez • v
cov (A;c,^) = A; cov (c,z) •
var (ft?) = k 2 var 5 , x
m(*e).a = fcm M •
cov (c + $,«) = cov (c, z) 4 cov (q,z) /
Further readings
Tlic art of model specification is learned by practice and by studying
cleverly contrived models. Beach, 1 Klein, Tinberoen, and Tintnee give
several examples. L. R. Klein and A. S. Goldberger, An Econometric Model
of the United States: 19291952 (Amsterdam: NorthHolland Publishing
Company, 1957), present a celebrated largescale econometric model. Chap
ters 1 and 2 give a good idea of the difficulties of estimation. The performance
of this model is appraised by Karl A. Fox in "Econometric Models of tha
United States" {Journal of Political Economy ; vol. 44, no. 2, pp s 128143,
April, 195G) and by Carl Christ in "Aggregate Econometric Models" (American
Economic Review, vol. 46, no. 3, pp. 385408, June, 1956).
"The Dynamics of the Onion Market," by Daniel B. Suit! and Susumu
Koizumi (Journal of Farm Economics, vol. 38, no. 2, pp. 475484* May, 1956)^
is an interesting example of econometrics applied to a particular market In
the short run.
Kendall, chap. 7, reviews the logic of probability, sampling, and expected
value. For a lucid discussion of the concept of randomness, iej M« G.
Kendall, "A Theory of Randomness" (Biomelrika, vol. 32, pt. 1, pp. 115,
January, 1941).
As far as I know, the assumptions about the random terra have not been
discussed systematically from the economic point of view, except for Mar
schak's brief passage (pp. 1215) in Hood, chap. 1, sec. 7. See also Gerhard
Tintncr, The Variate Difference Method (Cowles Commission Monograph 5,
pp. 45 and appendixes VI, VII, Bloomington, Indiana: Prineipia Press,
1940), and Tintner, "A Note on Economic Aspects of the Theory of Errors
in Time Series" {Quarterly Journal of Economics, vol. 53, no. 1, pp. 141149,
November, 1938).
As defined in economics textbooks, the production function and the cost
function necessarily violate Assumption 2. In no instance (whether in the
universe, the population, or the sample) can the random disturbance exceed
zero in the production function and fall short of zero in the average cost
1 See Frequent References at front of book. Works of authors whose names
are capitalized are listed there.
22 THE FUNDAMENTAL PROPOSITION OF ECONOMETRICS
function. All statistical studies of production and cost functions I know of
have implicitly used the assumption that zu «■ 0. The error is in the assump
tion of normality. Sco, for instance, Joel Dean, "Department Store Cost
Functions," in Studies in Mathematical Economics and Econometrics, in memory
of Henry Schultz, edited by Oscar Langc, Francis Mclntyre, and Theodore 0.
Yntema (p. 222, Chicago: University of Chicago Press, 1942), which is also an
interesting attempt to fit static cost functions to data from years of large
dynamic changes. In this respect I was guilty myself in "An Econometric
Model of Growth: U.S.A. 18691953" {American Economic Review, vol. 45,
no. 2, pp. 208221, May, 1955).
For examples of nonadditive disturbances, see Hurwicz, "Systems with
Nonadditivo Disturbances," chap. 18 of Koopmans, pp. 410418.
CHAPTER 2
Estimating criteria and the method
of least squares
2.1. Outline of the chapter
This chapter, like the previous one, deals exclusively with single
equation models. Unless the contrary is stated, all the Simplifying
Assumptions of Sec. 1.9 remain in force. The main points of this
chapter are the following:
1. Once we have specified the model and made certain stochastic
assumptions, our sample tells us nothing about the unknown parame
ters of the model unless we adopt an estimating criterion.
2. A very reasonable (and hard to replace) criterion is maximum
likelihood. It is based on the assumption that, while we were taking
the sample, Nature performed for our benefit the most likely thing,
or generated for us her most probable sample.
3. Once the maximum likelihood criterion is adopted, we can tell
precisely what the unknown parameters must be if our sample was
the most likely to turn up. This is what is called maximizing the
likelihood function. We find the unknowns by manipulating this
function.
23
24 ESTIMATING CRITERIA AND THE METHOD OP LEAST SQUARES
4. The familiar least squares fit arises as a special case of the oper
ation of maximizing the likelihood function.
5. In many cases, adopting the maximum likelihood criterion auto
matically generates estimates that conform to other estimating
criteria, for example unbiasedness y consistency, efficiency.
6. If estimates of the unknown parameters are unbiased, consistent,
etc., this does not mean that our particular sample or method has
given us a correct estimate. It means that, if we had infinite facilities
(or infinite patience), we could get a correct estimate "in the long run"
or "on the average."
7. The likelihood function not only tells us what values of the
parameters give the greatest probability to the observed event but
also attaches to such values degrees of credence, or reliability.
Though tinse statements can be made about all sorts of models,
the singleequation model of consumption that I have been using all
along captures the spirit of the procedure. Multiequation models
have all the complications of singleequation models plus many others.
2.2. Probability and likelihood
In common speech, probability and likelihood are but Latin and
Saxon doublets. In statistics the two terms, though often inter
changed for the sake of variety or style, have distinct meanings.
Probability is a property of the sample; likelihood is a property of the
unknown parameter values.
Probability
Imagine that, in a model that described Nature's workings perfectly,
the true values of the parameters a, /?, 7, . . . were such and such
and that the true stochastic properties of the error term u were such
and such. We would then say that certain types of natural behavior
(i.e., certain samples or observations) were more probable than others.
For example, if you knew that a river flowed gently southward at a
speed of 3 miles per hour, that an engineiess boat drifting on it had
such and such dimensions, weight, and friction (the model); if, in
addition, you knew that gentle breezes usually blow in the area, very
2.2. PROBABILITY AND LIKELIHOOD 25
rarely faster than 5 miles per hour, and that they usually blow now
in one, now in another direction (the stochastic properties) ; then you
would be very much surprised to find an instance in which the boat
had traveled 25 miles northward or 30 miles southward in the space
of 2 hours (the improbable behavior).
Likelihood
Now reverse the position. If you were sure of your information
about the wind, if you did not know which way or how fast the river
flowed, but you observed the boat 28 miles south of where it was
2 hours ago and were willing to assume that Nature took the most
probable action while you happened to be observing her, then you
would infer that the river must have a southward current of 14 miles
per hour. This is the maximum likelihood estimate, or most likely
(NOT most probable) speed of the river on the evidence of this
unfortunate sample. Any other southward speed and any kind of
northward flow are highly unlikely, or less likely than 14 miles per
hour south.
To say that any other speed is less probable is to misuse the term.
The river's speed is what it is (3 miles per hour to the south) and it
cannot be more or less probable. What can be more or less probable
is the particular observation: that the boat has traveled southward
28 miles. This observation is very improbable if the river indeed
flows 3 miles per hour southward. It would be more probable if the
river flowed southward with a speed of 5, 7, or 10 miles. And it
would be most probable if the true speed of the river had been 14 miles
per hour. Evidently, a maximum likelihood guess can be very far
from the truth.
All estimation in econometrics operates as in this river example, no
matter how elaborate the model, sloppy or exquisite the sample.
What is so commendable about the maximum likelihood criterion,
if it cannot guarantee us correct or even nearly correct results? Why
assume that Nature will do the most likely thing? All I can say to
this is to ask: Well, what shall we assume instead? the second most
likely thing? the seventyfirst?
It is true that (in some cases) maximum likelihood estimates tend
to be correct estimates "on the average" or "in the long run" (see
Sees. 2.1 and 2.10)". These facts, however, are irrelevant, because we
26 ESTIMATING CRITERIA AND THE METHOD OF LEAST SQUARES
use the maximum likelihood criterion even when we plan neither to
repeat the experiment nor to enlarge the sample.
It is very important to appreciate just what maximum likelihood
estimation docs: The experimenter makes one observation, say, that
the boat had traveled 28 miles southward in 2 hours; he then asserts
hopefully (he does not know this) that the wind has been calm, because
this is the most typical total net wind speed for all conceivable 2hour
stretches; and so he lets his estimate of the speed be 14 miles.
Actually, we (who happen to know that the true speed is 3 miles)
realize that, while the experimenter was busy measuring, the weather
was not at all typical but happened to be the improbable case of
2 hours of strong southerly wind.
The same experimenter under different circumstances might esti
mate the speed to be G, 0.5, —2, —3.0, etc., miles per hour depending
on the wind's actual whim during the 2hour interval in which observa
tion took place.
Exercise
2.A Set up an econometric model of the riverandboat example
of Sec. 2.2, using the following symbols: dt for the number of miles
(from a fixed point) traveled southward in t hours by the boat, y for the
(unknown) speed of the river in miles per hour, and u t for the net
southbound component of the wind's speed in miles per hour. Let u t
have the following stochastic specification :
10 per cent of the time u t — 11 (southbound)
70 per cent of the time u t = (calm)
10 per cent of the time u t = — 5 (northbound)
10 per cent of the time u% = — 6 (northbound)
Construct a probability table giving the net wind effects for 2 hours in
succession. For each typr< of conceivable observation, derive the
maximum likelihood estimate of 7.
Digression on the multivariate normal distribution
The univariate normal distribution for a variable u with
universe mean eu and variance «r uu was written in Sec. 1.10 in the
fancy form
2.2. PROBABILITY AND LIKELIHOOD
27
p(u) = (2ir)w det (*„„)» exp [K(w  ctiXO" 1 ^  *01
(21)
because of the ease with which it generalizes to the multivariate
case.
Let u\ y Ui, . . . , un be N variables which have a joint normal
distribution. Define
U = VeC (UlyUiy . . . } un)
eu « vec (ewi,£W2, . . . ,zun)
<r uu —
«i«i
J^U N Ui
For <r„ lttJ we often write cov (wi,W2), or simply c n if the meaning
is clear from the context. Sometimes the inverse of <r att , usually
written (O" 1 * is written <r uu , and its elements are written a***'
or just o^. These superscripts are not exponents. If we need
to write an exponent we write it outside parentheses, as in equa
tion (21).
To get the multivariate distribution for u\ f Ut f . . . , us, all
we need to do is change the italic u f 8 of (21) into bold characters:
p(u) = (2tt)"/ 2 det (cr uu )>* exp [*4(u  eu)(cr uu )»(u  Su)]
(22)
This illustrates the principle noted in Sec. 1.2: that if an oper
ation, theorem, property, etc., holds for simple numbers, it holds
analogously for matrices. This is a great convenience, because
you can pretend that matrices are numbers and so collapse a
complicated formula into a shorter and more intuitive expression.
Moreover, by pretending a matrix is a number, you can get a
clear impression of what a formula conveys.
Exercises
2.B Write explicitly the joint normal distribution of the two
variables x and w.
2.C In Exercise 2.B, modify the formula for <r xx = aw and o,,* » 0.
28 ESTIMATING CRITERIA AND THE METHOD OP LEAST SQUARES
2.D Write in vector and matrix notation the formula
N N
— H 2, 2 ( Wm *"" eW ^ nn (Pn— tW n )
tn 1 n — 1
2.3. The concept of likelihood function
Consider again the riverandboat illustration of the previous sec
tion. Our information can come in one of several different ways.
1, A sample of one observation
Someone may have sighted the boat at the zero point at twelve
o'clock and 28 miles south of that 2 hours later. This is one observa
tion; it leads to i — 2 % = 14 miles per hour southward as the maxi
mum likelihood estimate of the river's speed. The number of hours
elapsed from the beginning to the end of the observation could have
been 1 or J^ or 7 or anything else.
2. A sample of several independent observations
We may have several observations like the above but made on
different days. For instance,
Observation Time elapsed Distance traveled
i
2 hours
28 miles south
2
4 hours
12 miles south
3
17 hours
44 miles south
3. Several interdependent observations
Observations may overlap ; as, for example,
Observation Time of observation Distance traveled
a 12 to 2 p.m. 28 miles south
b 1 to 5 p.m. 20 miles south
of the same day
Or information may come in even more complicated ways. The likeli
hood function can be constructed only if we know both the circum
stances of our observations and the readings derived from them.
Cases 1, 2, and 3 lead to different likelihood functions because the
2.4. THE FORM OF THE LIKELIHOOD FUNCTION 29
circumstances differ. Two observers, each of whom watched the boat
for 2 consecutive hours unbeknownst to the other, would set up two
likelihood functions identical in form into which they would feed
different readings. But each investigator would set up on« and r uy
one likelihood function. This is a function of a single sample^ th$
sample, his sample; no matter how independent, complicated, or
interdependent his observations may be, they form a single sample.
The maximum likelihood criterion tells us to proceed as if Nature
did the most probable thing. We assert this about the totality of
observations in the sample rather than about any single observation.
2.4. The form of the likelihood function
Return to the consumption model C t = a + yZ t 4* t*i. Tht3 fol*
lowing statement must be accepted on faith (its proof is a decpish
theorem in analysis): Under the assumptions that Nature conforms
to the model and that the true values of the parameters are a and y t
the probability of observing the particular sample Ci, ft, . . , , C& 9
Zi, Z*, . . . , Z$ is equal to the probability that the error term shall
have assumed the particular values u%, u 2 , . . . , u$ multiplied by &
factor det J.
The term det J happens to be equal to 1 in all single*^
cases; so we need not worry about it yet. It becomes important }i
twoormorccquation models*
The statement cited above is of immense and curioM§ ^gniflcmnee^
We observe the sample C„ Z,. But we cannot know (HfrvAlj h®W
probable or improbable it is to obtain this particular Simple, s\m<i
all our stochastic assumptions have to do with the probability distribu
tion of the u's } not of the C's and Z's. On the other hand, Wi mm
never observe the random errors themselves. So one might despciir
of finding the probability of this particular sample but far tfm remark
able property cited. Let L stand for the probability ©f the sample
and q for the probability that the random term will take cm the values
Wii w 2 , . . . , Us. Then we have
L = det J • q(ui,u 2 , ... y u s ) (24)
Now, the (unobservable) w's are functions of the (observed) C's
and Z's and of (the unknown) a and 7, because the model implies
30
ESTIMATING CRITERIA AND THE METHOD OP LEAST SQUARES
Ut = Ct — a — yZ t . To maximize likelihood is to seek the pair of
values of a and 7 that makes L as large as possible.
What form q(u h . . . ,u s ) takes depends on the stochastic assump
tions about the error term.
This concludes the discussion of the logic behind maximum likelihood
estimating of a and 7.
In the next few pages I discuss the mechanism of maximizing L under
the Six Simplifying Assumptions. On a first reading, you might skip
the rest of this section without serious loss. Readers who wish to
refresh your manipulative skills, read on! We shall now omit writing
det J = 1, since we are discussing only singleequation cases at this
point.
By Simplifying Assumption 4, the random terms u h u% } . . . , us
come from a multivariate normal distribution. Therefore (22)
applies, and
L  q(u h u h . . . ,u s ) = (2ir)sn det (<r uu )*, ^ 4
exp [y 2 (u  eu5V uu ) l (u  en)] (24)
By Simplifying Assumption 2, Zu t — 0. Simplifying Assumption 3
states that all diagonal elements of o uu are equal to a finite constant
(T uuy and Assumption 5 states that all nondiagonal elements are zero;
so
det (cr uu )>* = (*uu) Sf *
Therefore, (24) reduces to
L  (2ir) s >*(v uu ) s '* exp
•1 1
u
and finally to
a
L = (2t).s/V««) 5 " exp [^((r Wtt ) 1 ^ w 2 .] (25)
•1
The following properties of L will not be proved:
1. L is a continuous function with respect to a, 7, cr„„ except at
<r« tt = 0. This means that it can be differentiated quite safely. As
for the exception, we need not worry about it; for w, as a random
variable, assumes at least two distinct values in the universe, and
2.5. JUSTIFICATION OP THE LEAST SQUARES TECHNIQUE 31
therefore c U u > 0. If the sample is of only two observations, the fit
is perfect; m uu is zero — but in that case we do not use the likelihood
approach at all. We just solve the two equations C\ =» a + yZi and
Ci = a 4" 7Z2 for the two unknowns a and 7.
2. Setting the partial derivatives of L equal to zero locates its
maxima. It has no minimum; therefore, we do not need to worry
about secondorder conditions of maximization.
3. L is a maximum when its logarithm is a maximum. So, instead
of (25), we maximize the more convenient expression
log L m   log 2ir  g log <r«„  g (*««)~ l \ u \ (^ 6 )
4. The'true values of a, 7, and <r uu are not functions of one another,
but constants. Therefore, in maximizing, all partial derivatives? of m,
y, and <r with respect to one another are zero.
Maximizing (26) results in
V (C.  a  yZ.) 
V (C.  «  7^)2.  0' (2 .7)
J V (C.  « ~ 7^.) 2  *..
The solution of (27) for a, 7, o^m gives the maximum likelihood esti
mate <$, ^, £.
2.5. Justification of the least squares technique
It is evident that (26) gives least squares estimates for a, 7, a.
System (27) says that the maximum likelihood values of a and 7
are the values that minimize the sum of the squares of the residuals #,.
The last equation in (27) states that the maximum likelihood esti
mate & uu of the true variance o„„ is the average square residual.
This, then, is the justification for minimizing squares. Remember
that to get this result we had to make use of a great many assumptions
both about the model itself and about the nature of its error term.
If any one of these many assumptions had not been granted, we
might not have reached this result. Therefore, one should not go
32 ESTIMATING CRITERIA AND THE METHOD OF LEAST SQUARES
about minimizing squares too lightheartedly. For every different set
of assumptions a certain estimating procedure is best, and least
squares is best only with a proper combination of assumptions. Con
versely, every estimating procedure contains in itself (implicitly) some
assumptions either about the model, or about the distribution of w,
or both. 1
Digression on computational arrangement
uu
It pays to develop a tidy scheme for computing d, 1, and &
because computation recipes similar to (27) turn up pretty often.
It is always possible to arrange the computations in such a
way as to estimate the coefficient y of the independent variable
first. With •? in hand, one computes the constant term d.
Finally, with d and 1, one computes the residuals #„ and from
these residuals, an estimate of <r uu .
An analogous procedure for models having several independent
variables (and, hence, several 7s) is developed in the next
digression.
In all cases, that is to say, for simple as well as for complicated
models, I shall describe only the computational steps for estimat
ing the 7s (coefficients of the independent variables).
Write (27) as follows:
aS + yXZ, = 2CV
aZZ, + yZZ\ = 2C,Z.
where the sums run over the entire sample. Now subtract 2Z.
times the first equation from S times the second. The result is
7 [£2Z 2  (2Z)(2Z)]  [S2CZ  (2C)(2Z)] (28)
Note that we have eliminated a and that, moreover, in the
square brackets we may recognize the familiar moments, defined in
Chap. 1. Thus (28) is equivalent to
1 The Six Simplifying Assumptions are sufficient but not necessary conditions
for least squares. Least squares is a "best linear unbiased estimator" under
much simpler conditions. This, however, is another subject. I chose these
particular six assumptions because with them it is easy to show how a stochastic
specification and an estimating criterion lead to a specific estimate of a parameter
rather than to some other estimate.
2.6. GENERALIZED LEAST SQUARES 33
yvftzz = m>cz
and the estimate of 7 can be expressed very simply as
<? = (m zz ) l m zc or ^£ (29)
mzz
which, besides being compact, generalizes easily to N dimensions,
i.e., by replacing the Greek and italic letters by the correspond
ing characters in boldface type :
?  (m Z z) 1 mzc or ^ (29a)
mzz
2.6. Generalized least squares
All the principles discussed so far apply to all linear models consisting
of a single equation. To treat the general case, we shall make a slight
change in notation: y will stand for any endogenous variable (the role
played by consumption C so far) and z for any exogenous variable
(the role played by lagged income Z so far).
Let us suppose that the endogenous variable y(t) depends on //
different predetermined variables Z\(t), 22(0 > . . . , zn(t) as follows
(omitting time indexes) :
y = a f 7i2?i  72^2 + ■•'••.+ ynZ H + u (210)
Indeed, the analogy of (210) with C = a f yZ + u is so perfect
that everything said about the latter applies to the shorthand edition
of (210):
y = a + yz + u
where y is the vector (71,72, . . . ,7/y) and z is the vector 1 (21,2s,
. • • ,2//).
But we must be careful. The first five Simplifying Assumptions
(u is random, has mean zero, has constant variance, is normal, and is
serially independent) need no alteration* Assumption 6 must, how
ever, be changed to read as follows: The error term u t is fully inde
pendent of Zi(jt) t 2 2 (0> • • • > 2#(0.
Under the new version of the Simplifying Assumptions, the maxi
1 For typographical simplicity I shall not bother, in obvious cases, to dis
tinguish a column from a row vector. In this case z is a column vector.
34
ESTIMATING CRITERIA AND THE METHOD OP LEAST SQUARES
mum likelihood criterion leads to the estimate of 71, 72, ... , in
that minimizes the sum of the squared residuals. And, moreover,
these estimates are given by the formula
t  (m,,)^,
(211)
which is exactly analogous to (29). What these boldface symbols
mean is explained in the next digression.
Digression on matrices of moments and their determinants
This is a natural place to introduce some extremely convenient
notation, which we shall be using from Chap. 6 on.
If p, q, r, x, y are variables, m^ p , q , r ).( XlV ) is the matrix whose
elements are moments that can be constructed with p, q, r on x
and y. The variables in the first parentheses correspond to the
rows a^d those in the second to the columns. Thus,
m (p.«.r)(x.v)
m,
m.
qx
m rx
Ttlpy
m qy
Wry j
The middle dot in the subscript may be omitted.
Likewise, m« means the matrix whose elements are moments of
the variables z if z 2 , . . . , zh on themselves:
m«« =
*i*i
i*»
Lm
IB***
m
*fl»t
and m™ means
m
*iV
m
*l*B
m :
*a*B*
.m t *u
2.7. THE MEANING OF UNBIASEDNESS 35
Every square matrix has a determinant. So does every square
matrix of moments, for instance, m M ; for the determinant of m
we write det m, perhaps with the appropriate subscripts dct m n ,
or det m< Zl ,*, * B )Uu*% **>•
But it is simpler to write m„ instead of det m or det m„;
and we shall do this for compactness. The lightface italic m in \
the expression m zz indicates that the determinant is a simple
number, like 2 or 16.17, and neither a vector nor a matrix of
numbers (these are printed bold).
One way to estimate the coefficients 71, 72, . . . , 7# is to
perform the matrix operations given in (211). Another way
is by Cramer's rule } which calculates various determinants and
computes
71 ■=
™(fn*i *m)(*i. » *h)
A rtl( tx ,y, ... . t „ )(tl, t t f//)
72 =
W( Zl>f( . , . , f tf )(*,,*, g B )
(212)
* __ Wjtu't l/)(gl.« *b)
iff yy.
"*(»li* *b)Ui>** *b)
Both these ways are very cumbersome in practice for equations
with more than three or four variables, unless we have ready
programs on electronic computers. Appendix B gives a stepwise
technique for calculating ^1, ^2, . . . , 1h that can be used on
an ordinary desk calculator.
Matrix inversion is discussed in Appendix A.
2.7. The meaning of unbiasedncss
Let us discuss bias and unbiasedncss by using the original model of
consumption C t = a + yZ t f u>t with the understanding that all
conclusions hold true for the generalized singleequation model
y = a + 71Z1 + * * ' + IhZh + u. Furthermore, we can restrict our
selves, with some exceptions, to the discussion of 7, because the
statements to be made are also true of a.
36 ESTIMATING CRITERIA AND THE METHOD OP LEAST SQUARES
Imagine that we obtain our guess ^ of the parameter 7, violating
none of the Simplifying Assumptions. The guess so chosen is the
most likely in the circumstances. But this does not guarantee it to
be equal to the true value 7. This is so because the observations
C„ Z, we have to go on are just a sample. And in sampling anything
can happen. Extremely atypical misleading samples are improbable
but perfectly possible. So it makes sense to ask how f ar off the guess
i is likely tojbe from the true value 7.
Here it is very important to distinguish between (1) taking again
and_agcoin a sample of size S, (2) taking bigger and bigger. samples
( one of ea ch size) . The first procedure is connected with the important
statistical notion of bias, the second with that of consistency. Both
procedures are ideal and impractical, because such samples must be
taken from the universe (level III) and not merely from the population
(level II). Therefore, even with infinite resources and infinite
patience, the concepts are not operational.
Consider any estimating recipe (say, least squares). Choose a
sample size, say, S = 20; draw (from the universe) all possible samples
of size 20; for each sample compute (by least squares) the correspond
ing <$; then average out these fs. If the average 'f equals the true 7,
then we say that the procedure of least squares is an unbiased method
for estimating 7, or an unbiased estimator of 7, for sample size S — 20.
Loosely, we might saj r that on the average, least squares gives a correct
estimate of 7 from samples of 20 observations.
In order to pin down firmly the concept of bias, I have constructed
a purposely simple and exaggerated example. It involves just three
time periods, very uneven disturbances from time period to time
period, and a random disturbance that assumes just three different
values. Yet this example illustrates all that could be shown with a
larger and more realistic one.
Assume that the true values of the parameters we seek to estimate
are a = 4, 7 = 0.4. Assume that the population consists of exactly
3 elements, labeled a, b, c, whose coordinates are given in Table 2.1
below; the three points are shown in Fig. 2. This population could
have come from an infinite universe* but let us (for pedagogic reasons)
deal with a finite universe that consists of the above three points
a, b, c plus four more, which are named a', b' y c', and c". Every point
of the universe is completely defined when we specify the random
2.7. THE MEANING OP UNBIASEDNESS
37
©c' and C"
True (exact) relation
Cj4+0.4Z t
Zt
10 15 20
Income
Fig. 2. A sevenpoint universe. Solid dots: points in the population. Hollow
dots: points in the universe but not in the population.
Table 2.1
The population (a,b,c)
Point
Time
V
Z v
c v
u p
a
1
4.05
0.05
6
2
4
0.4
c
3
14
0.6
9.0
error u that corresponds to it and the level of the independent variable.
These are given in Table 2.2.
Table 2.2
Time
Points
IN THE POPULATION
Points in the universe
but not in the population
Name
Value
of u t
Value
OF Zt
Name
Value
OF Ut
ValUH
OF Zt
1
a
0.05
a'
0.05
2
b
0.40
4
b'
0.40
4
3
c
9.00
14
c'
+4.50
14
3
...
C"
+4.50
14
Exercise
2.E Which Simplifying Assumptions are fulfilled by u t in the uni
verse of Table 2.2, and which are violated?
38
ESTIMATING CRITERIA AND THE METHOD OF LEAST SQUARES
Now let us see if the least squares method is an unbiased estimator
of 7. First let us take all conceivable samples of size 2 and for each
compute the least squares value *?» Samples should be taken in such a
way that the same time period is not represented more than once.
The population can yield only the following pairs: (a,b), (a,c), and
(&,c). This is the most that a fleshandblood statistician, even one
equipped with unlimited means, could obtain operationally, because
points a', 6', c', and c" exist, so to speak, only in the mind of God.
But the definition of bias requires us to check samples (of size 2) that
include all points of the universe, human and divine alike. There are
sixteen such samples, and the corresponding estimates & and 1 are
given in Table 2.3 and plotted in Fig. 3. When all sixteen are con
sidered, it is seen that least squares is an unbiased estimator of 7
(and of a).
Table 2.3
Estimates of a and y from samples of size 2
Points in the sample
CORRESPONDING ESTIMATE
OP 7
OF a
>a b
0.4875
4.0500
a V
0.2875
4.0500
■■" a c
0.2464
4.0498.
a e
0.7179
4.0497
a c"
0.7179
4.0497
a'b
0.5125
3.9500
a'b'
0.3125
3.9500
o'c
0.2393
3.9501
a' c'
0.7250
3.9500
a' c"
0.7250
3.9500
.6 c
0.5400
8. 1600 <
b c'
0.8100
2.7000
b c"
0.8100
2.7G00
b' c
0.4G00
7.0400
V C '
0.8000
1.6400
b' c"
0.8900
1.6400
Average of all conceivable samples
57  0.4000
e&  4.0000
Average of all feasible samples
(un
primed points)
0.0997
5.4199
If we try all samples of size 3, we get the results tabulated in Table 2.4
and plotted in Fig. 4.
2.8. VARIANCE OF THE ESTIMATE
30
Table 2.4
Estimates of a and y from samples of size 3
Points in
*
THE SAMPLE
a b c
5.36728
0.30288
a b c'
3.63652
0.73558
a b c"
3.63652
0.73558
a b' c
5.00833
0.28750
a b'c'
3.27757
0.75096
a b' c"
3.27757
0.75096
a'bc
5.29939
0.29712
a'b c'
3.56857
0.74135
a' b c"
3.56857
0.74135
a' b'c
4.94038
0.28173
a'b c'
3.20962
0.75673
a' b' c"
3.20962
0.75673
Average
e&  4.0000
e  0.4000
For a sample of size 3, the least squares method is an unbiased esti
mator of both 7 and a.
In certain cases, not illustrated by our simple example, (1) an
estimating technique (say, least squares) may be unbiased for gome;
sample sizes and biased for other sizes; (2) a method may overestimate
7 for certain sample sizes and underestimate it for others, on th$
average; (3) we may be able to tell a priori, knowing the sample size S,
whether the bias is positive or negative (in other cases we cannot) 
(4) a method may be unbiased for one parameter but biased for another.
2.8. Variance of the estimate
In Fig. 3 I have plotted all the estimates of a and 7 for all posalbl©
samples of size 2. The same thing was done for size B in Fig. 4* In
general, the estimates are scattered or clustered, depending (I) on the
size S of the sample, (2) on the size and other features of the universe,
(3) on the particular estimating technique we have adopted, and
ultimately (4) on the extent to which random effects dominate the
systematic variables. Other things being equal, we prefer an esti
mating technique that yields clustered estimates. The spread among
the various estimates $ is called the variance of the estimate •?, and is
40 ESTIMATING CRITERIA AND THE METHOD OF LEAST SQUARES
written <rtf or <r(1,1), or, sometimes, <r(W\S) if we want to emphasize
what size sample it relates to.
The variance is defined by
and is a constant, which exists and can bo computed if the four items
listed above arc known. Table 2.5 gives the values of <r for our
7
0.9
.
0.8
•
0.7
*0
0.6

0.5
••
0.4
0.3
,
•fa,yW4,0.4)
o
0.2
0.1
1 2
3
4 5 6 7 a
0.1
0.2
0.3
0.4
0.5
0.6
Fig. 3. Parameter estimates from all
samples of size 2. $: double point.
7
0.8
0.7
*t
0.6
0.5
0.4
0.3
• fovy)(4,0.4)
0.2
0.1

0.1

1 2
3 4 5 6 7
0.2

0.3

*«b
0.4

0.5
0.6
Fig. 4. Parameter estimates from all
samples of size 3. <f>: double point.
sevenpoint example. Note the interesting (and counterintuitive)
fact that the variance of the estimate can increase as the sample size
increases! This quirk arises bec^usej^jn^Jhe example, the random
disturbance has a skew distribution. If u is symmetrical, the variance
of the estimate_decrea ses as th ej sarnp_le_ siz e increase s.
Table 2.5
Sue S op samples <r($,7)
0.2325
0.2397
2.9. ESTIMATES OP THE VARIANCE OP THE ESTIMATE 41
2.9. Estimates of the variance of the estimate
If we have complete knowledge, we can compute the true value of
(r(i rf\S) by making a complete list of all samples of size S, computing
all possible estimates of 7, and finding their variance, as I did in the
above example. In practice, however, it is impossible to exhaust all
samples of a given size, because the universe contains points that are
not in the population. So, instead, we must be content with gue&ding
at the variance of the estimate by the use of whatever information is
contained in the single sample we have already drawn.
At first, you might suppose that estimating <r($rt\S) is logically
impossible when you have a single sample of size S to work with,
because, after all, the variance of the estimate of 7 represents what
happens to 7 as you take all samples of size S.
All is not lost, however, because a single sample of size S contains
several samples (S of them) each of size <Sminusl. The latter we can
generate by leaving out, one at a time, each observation of the original
sample. Thus, if the original sample is (a,b,c) of size S — 3, it con
tains three subsamples of size 2 each, the following ones: (a,6), (o,c),
and (6,c), which yield, respectively, the three estimates f (a, 6), 1(a,c),
and f(b,c). We get, then, some idea about variations in the estimate
of 7 among samples of size 2. Still, we know nothing about the variance
of 7 as estimated from samples of size 8. Here we invoke the maximum
likelihood criterion. The original sample (a,6,c) was assumed to be
the most probable of its kind, namely, the family of samples containing
three observations each. If this is so, then observations a, 6, c
generate the most probable triplet T = {(a,6),(a,c),(6,c)j of samples
containing two observations each. Therefore, the variability of i
(in the triplet T) estimates its variability in samples of size 3.
From Table 2.3,
jfafi) = 0.4875
1(a,c) = 0.2464
?(&,c) = 0.5400
Average = 0.0997
The variance of f in the sample triplet is equal to
HK0.4875 4 0.0997) 2 + (0.2464 + 0.0997) 2
+ (0.5400 + 0.0997) *]  0.1867
42
ESTIMATING CRITERIA AND THE METHOD OF LEAST SQUARES
The last figure must now be corrected by the factor £minusl «■ 2 if it
is to be an unbiased estimate of the variance of $ (a,6,c). Estimates of
variance based on averages, if uncorrected, naturally understate the
variance. The proof that
&  0.1867 X 2  0.3734
is an unbiased estimate of o(f,f3) is in Appendix C.
In practice we are too lazy to estimate y again and again for all
the subsamples. The formula cr(f,f \S) .= m^/(S — l)mzz gives a
shortcut (and biased) estimate of the variance of 1 for samples of the
original size. Table 2.6 lists these estimates for threepoint samples
and repeats some of the information from Table 2.4.
Tabic 2.6
Estimates of 7 and of the variance of its estimates
Points in
7
1(» M m ^
THE SAMPLE
J( "' )u, (Sl)m^
a b c
0.30288
0.02605
a b c'
0.73558
0.00256
a b c"
0.73558
0.00256
a b' c
0.28750
0.01377
a b'c'
0.75096
0.00895
a V c"
0.75096
0.00895
a'bc
0.29712
0.02731
a'b c'
0.74135
0.00217
a' b c"
0.74135
0.00217
a'b'c
0.28173
0.01377
a' b' c'
0.75673
0.00822
a' b' c"
0.75073
0.00822
Table 2.6 must be interpreted carefully. To begin with, the investi
gator will usually know only its first line, because he has a single
sample to work with. The remaining lines are put in Table 2.6, for
pedagogic reasons, by the omniscient being who can consider all
possible worlds. Events, could have followed one or another course
(and only one) among the courses listed in the several lines of Table 2.6.
It just happened that (a,6,c) materialized and not some other triplet.
It yielded the two estimates f = —0.30288, a very wrong estimate, and
ff = 0.02C05. The latter misleads us to believe in the likelihood of the
former.
2.11. THE MEANING OP CONSISTENCY 43
If sample (a',h',c') had materialized, the two guesses would have
been 1 = 0.73558 (not so bad as before) and 5 = 0.00256, which is
ten times as "confident" as before. It is entirely possible for a
sample to give a very wrong parameter estimate with a great deal of
confidence. The mere fact that 5(1,1) is small does not make 1 a
good guess.
It is comforting, of course, to have some measure of how much 1
varies from sample to sample. What is upsetting is that the measure
is itself a guess. True, it is better than nothing, but this is no con
solation if by some quirk of fate we have picked a sample so atypical
that it gives us not only a really wrong parameter estimate 1 , but also
a really small 5(1,1). The moral is: Don't be cocksure about the
excellence of your guess of y just because you have guessed that its
variance v(1,1) is small.
2.10. Estimates ad nauseam
Note carefully now that, whereas cr(1,1) is a constant, 5(1,1) is not,
but varies with each sample of the given sizeJ Therefore 5(1,1) itself
has a variance, which we may denote by c(5(1,1))\ this is a true
constant. Now there is nothing to prevent us from making a guess at
the latter on the basis of our sample, and this guess would be symbolized
by 5(5(1,1)), which is no longer a constant but varies with each sample,
and so has a true variance a(5(5(1,1))) — and so on, ad infinitum. In
other words, we cannot get away from the fact that, if all we can do
about 7 is to guess that it equals 1 , then all we can do about its variance
c(1,1) is to guess it too; likewise all we can do about this last guess is to
guess again about its true variance, and so on forever. Guess we
must, stage after stage, unless we have some outside knowledge.
Only with outside knowledge can the guessing game stop. The game
is rarely played, however, beyond 5(1,1), (1) because it is quite tedious,
and (2) because large enough samples give good 1a and as with high
probability.
2.11. The meaning of consistency
As in our explanation of unbiasedness, let us discuss the parameter
7 of the model C t = a + yZ t + u t with the understanding that all
44 ESTIMATING CRITERIA AND THE METHOD OF LEAST SQUARES
conclusions generalize to all the parameters in the model
y = a + 7i2i ) h ynzn + u
Consider any estimating recipe, say, least squares or least cubes.
Choose a sample of a given size, say, S = 20, and compute *?. Then
choose another sample containing one more observation (S = 21)
and compute its ^. Keep doing this, always increasing the sample's
size. The bigger samples do not have to include any elements of the
smaller samples — though this becomes inevitable as the big samples
grow, if the universe is finite. 1 If, as the size of the sample grows, the
estimates f improve, then we say that the least squares procedure is a
consistent estimator of 7. Note that 1 does not have to improve in
each and every step of this process of increasing the size of the sample.
Improvement in the above paragraph means that the probability
distributions of 1(S), 1(S f 1), . . . become more and more pinched
as the} r straddle the true value of the parameter.
Digression on notation
There are two variant notations for consistency. Let y(s) be
the consistent estimator from a sample of s observations. Let e
and Tj be two positive numbers, however small. Then there is,,
some size S for which
P(\y(s)  7I < > 1  v
if s > S. A shorthand notation for the same thing is
P lim y(s) = 7
which is to be read "y(s) converges to 7 in probability/' or
"probability limit of y(s) is 7."
Under very weak restrictions, a maximum likelihood estimate is also
a consistent estimate. Note, however, that, even when the method is
consistent, there is no guarantee that the estimate will improve every
time we take a larger sample. It may turn out that our sample of
size 2 happens to contain points a and b, which give an estimate
f(2) = 0.4875, and the next larger sample happens to contain points
1 A sample could, of course, bo infinite without ever including all the elements
of an infinite universe.
2.12. THE MERITS OP UNBIASEDNESS AND CONSISTENCY 45
o, 6, and c, which give an estimate 7(3) » —0.30288, which is much
worse. Even when the larger sample includes all the points of the
smaller, as in the example just cited, it can give a worse estimate.
This is so because the next point drawn, c, may be so atypical as to
outweigh the previous typical points a and b.
2.12. The merits of unbiasedness and consistency
Are the properties of unbiasedness and consistency worth the fuss?
Remember the fundamental fact that with limited sampling resources
it is not possible to estimate y correctly every time, even when the
estimating procedure is unbiased and consistent.
Because of a small budget, our sample may be so small that ^ has a
large variance. Even if the sample is large, it may be an unlucky one,
yielding an extremely wrong estimate. The mistake has happened,
and it is no consolation to know that, if we had taken all possible
samples of that size, we would have hit the correct estimate on the
average. The following complaint is a familiar one from the area of
Uncertainty Economics: Some people advise me to behave always so as
to maximize my expected utility; in other words, to make onceina
lifetime decisions as if I had an eternity to repeat the experiment.
Well, if I get my head chopped ofT on the first (and necessarily final)
try, what do I care about the theoretical average consequences of my
decision? Wherever a comparatively crucial outcome hinges on a
single correct estimate, unbiasedness is not in itself a desirable property.
Likewise, it is mockery to tell an unsuccessful econometrician that
he could have improved his estimate if he had been willing to enlarge
his sample indefinitely.
What, then, is the use of unbiasedness and consistency? In them
selves they are of no use; they do help, however, in the design of
samples and as rules for research strategy and communication among
investigators.
There is a body of statistical theory — not discussed in this work —
which tells us how to redesign our sample in order to decrease bias and
inconsistency to some tolerable level. For example, with infinite
universe, if we have two parameters to estimate, the theory says that
a sample must be larger than 100 if consistency is to become " effective
at the 5 per cent level." Whether we want to take a sample that large
46 ESTIMATING CRITERIA AND THE METHOD OF LEAST SQUARES
depends on the use and strategic importance of our estimate as well as
on the cost of sampling. All this opens up the fields of verification and
statistical decision, into which we shall not go here.
Unbiasedness, consistency, and other estimating criteria to be
introduced below are sometimes conceived of as scientific conventions: 1
If content to look at the procedure of point estimation unpretentiously as a
social undertaking, we may therefore state our criterion of preference for a
method of agreement so conceived in the following terms:
(i) different observers make at different times observations of one and the
same thing by one and the same method;
(ii) individual seta of observations so conceived are independent samples
of possible observations consistent with a framework of competence, and as
such wo may tentatively conceptualise the performance of successive
sets as a stochastic process;
(iii) we shall then prefer any method of combining constituents of observa
tions, if it is such as to ensure a higher probability of agreement
between successive sets, as the size of the sample enlarges in accordance
with the assumption that we should thereby reach the true value of the
unknown quantity in the limit;
(iv) for a given sample size, we shall also prefer a method of combination
which guarantees minimum dispersion of Values obtainable by different
observers within the framework of (i) above.
In the long run, the convention last stated guarantees that there will be a
minimum of disagreement between the observations of different observers, if
they all pursue the same rule consistently. . . . We have undertaken to
operate within a fixed framework of repetition. This is an assumption which
is intelligible in the domain of surveying, of astronomy or of experimental
physics. How far it is meaningful in the domain of biology and whether it is
ever meaningful in the domain of the social sciences are questions which we
cannot lightly dismiss by the emotive appeal of the success or usefulness of
statistical methods in the observatory, in the physical laboratory and in the
Cartographer's office.
Philosophers of probability are still debating whether the italics of
the quotation do in fact define a universe of sampling, whether it can
be defined apart from the postulate that an Urn of Nature underlies
everything, and whether the above scientific conventions become
reasonable only upon our conceding the postulate.
1 Lancelot Hogbcn, Statistical Theory, pp. 1106207 (London: George Allen &
Unwin, Ltd., 1957). Italics added.
2.14. LEAST SQUARES AND THE CRITERIA 47
2.13. Other estimating criteria
So far I have mentioned three estimating criteria, or properties that
we might desire our estimating procedures to have. These were
(1) maximum likelihood, (2) unbiasedness, (3) consistency. Some
others are:
4. Efficiency
If y and 1 are two estimators from a sample of S observations, the
more efficient one has the smaller variance. It is possible to have
^(7>7) < <r(1rt) f° r some sample sizes and the reverse for other sample
sizes; or one may be uniformly more efficient than the other; some
estimators are most efficient, others uniformly most efficient.
5. Sufficiency
An estimator from a sample of size S is sufficient if no other estimator
from the same sample can add any knowledge about the parameter
being estimated. For instance, to estimate the population mean, the
sample mean is sufficient and the sample median is not.
6. The following desirable property has no name. Let o($tf\S)
shrink more rapidly than <r(y,y\S) as the sample increases. Then i
is more desirable than y.
There is no end to the criteria one might invent. Nor are the criteria
listed mutually exclusive. Indeed, a maximum likelihood estimator
tends to the normal distribution as the sample increases; it is consistent
and most efficient for large samples. A maximum likelihood estimator
from a singlepeaked, symmetrically distributed universe is unbiased.
2.14. Least squares and the criteria
If all the Simplifying Assumptions are satisfied, the least squares
method of estimating a and y in singleequation models of the form
C t  a + yZ t + u t (213)
yields maximum likelihood, unbiased, consistent, efficient, and suf
ficient estimates of the parameters. This result can be generalized
48 ESTIMATING CRITERIA AND THE METHOD OF LEAST SQUARES
in a variety of directions. The first generalization is that it applies to
a model of the form
y(t)  « f 7i*i(0 f y%z%{t) + • • • 4 hMD + u(t) (214)
where y l§ the endogenous variable and the z'a are exogenous variables.
(Least squares is biased if some of the z's are lagged values of y— this
question is postponed to the r ext chapter.)
Lea^t squares yields maximum likelihood, unbiased, consistent,
sufficient, but inefficient estimates if the variance of u t is not constant
but varies systematically, either with time or with the magnitude of
the exogenous variables. Such systematic variation of its variance
makes u heteroskedastic.
2.15. Treatment of he teroskedasticity
We shall confine the discussion of heteroskedasticity to model
(213) on the understanding that it generalizes to (214).
The random term can have a variable variance <r U u(t) for various
reasons:
1. People learn, and so their errors of behavior become absolutely
smaller with time. In this case a(t) decreases.
2. Income grows, and people now barely discern dollars whereas
previously they discerned dimes. Here a{t) grows.
3. As income grows, errors of measurement also grow, because now
the tax returns, etc., from which C and Z are measured no longer
report pennies. Here c{t) increases.
4. Datacollecting techniques improve. v(t) decreases.
Consider Fig. 5. It shows a sample of three points coming from a
heteroskedastic universe.
Since the errors are heteroskedastic, we would, on the average,
expect observations in range 1 to fall rather near the true regression
line, observations in range 2 somewhat farther, and in range 3 farther
still. In any given sample, say, (a,b,c), points b and c should ideally
be "discounted" according to the greater variances that prevail in
their ranges. Using the straight sum of squares is the same as failing
to discount b and c. The result is that sample (a,b,c) gives a larger
value for 7 than it would if observations had been properly discounted.
If no allowance is made for the changing variance <r(t), least squares
2.15. TREATMENT OP HETEROSKEDASTICITY
49
fits are maximum likelihood, unbiased, and consistent but inefficient.
To show inefficiency, consider the likelihood function of (2*4), There,
the matrix 4 of the covariances of the random term not only was
diagonal but had equal entries; so it could factor out [lit (i^)] and
drop out when the likelihood function was maximized with respect to
7 (and a). It is this fact that made i an efficient estliimta, With
unequal entries along the diagonal, this is no longer possible, To
obtain an efficient, unbiased, and consistent estimate of y, we must
solve a complicated set of equations involving 7, <r(l), * ■ , , *(&)%
Somewhat less efficient (but more so than minimizing Su 2 ) is, to make
V
Range 1 Range 2
• x 1
■ !
1 Range 3 #c
1 ^*
■ 1
1
1
1
1
1
1
, 1
1
,!, , , ■ , .
4 14 Zg
Fig. 5. A typical sample from a heteroskedastio univerli.
(from outside knowledge) approximate guesses about cr(l), * » * , 9 (&)
and to minimize the sum of squares of appropriately " deflated * #
residuals (see Exercise 2.G). This, too, is an unbiased and consistent
estimate.
Exercises
2.F Prove that *? = mzc/mzz is unbiased and consistent even
when u is heteroskedastio.
2.G Let <f>(s) be an estimate (from outside information) of l/<r««(s).
Prove that minimizing 2<t>(s)u 2 (s) yields the following estimate of 7: 4
= (S(0)S0CZ)  (2<}>Z)(2<f>C)
7W (2<t>)(2ct>Z 2 )  (2tf>Z) 2
2.H Prove the unbiasedness and consistency of i (<£).
50 ESTIMATING CRITERIA AND THE METHOD OP LEAST SQUARES
Digression on arbitrary weights
The weights <f>(s) are arbitrary. Is there no danger that the
denominator of ^(<£) might be (nearly or exactly) zero and blow
up the proceedings?
Answer: There is none.
Proof
 £ <fc<fc(Z<  Zy) 2 >
It is perfectly proper to deflate the heteroskedastic residuals by the
exogenous variable Z itself and to fit by least squares the homoskedastic
equation
§ = «;~ + Y + ! (215)
instead of the original heteroskedastic one
C = a + yZ + u (216)
From (215) and (216) we obtain numerically different consistent
and unbiased estimates of a and y.
Exercise
2.1 Prove that d(Z) = m(c/z)(i/z)/m ( i/z)(i/z) is unbiased and con
sistent.
Further readings
Maurice G. Kendall, "On the Method of Maximum Likelihood" (Journal
of Vie Royal Statistical Society, vol. 103, pt. 3, pp. 389399, 1940) discusses
the reasonableness of the method and the concept of likelihood. Whether
the principle of maximum likelihood is logically wound up with subjective
51
belief or inverse probability is still under debate. The intrepid reader who
leafs through the last 30 or so years of the above Journal will be rewarded
with. the spectacle of a battle of nimble giants: Bartlett, Fisher, Gini, Jeffreys,
Kendall, Keynes, Pearson, Yule.
The algebra of moments is a special application of matrices and vectors.
Matrices and determinants are explained in the appendixes ©f Klein apd
Tintner. Allen devotes two chapters (12 and 13) to all the vector, matrl^
and determinant theory an economist is ever likely to need.
The estimating criteria of unbiascdness, consistency, etc., are clearly stilted
and briefly discussed in the first dozen pages of Kendall's second volume,
and debunked by Hogben in the reference cited in the text.
The reason for using m^/Sm tt as an estimate of o(f ,f ), the formula for
estimating cov (&,i) and cr((2,d), and the extensions of these formula! for
several y variables are stated and rationalized (in my opinion, not too con
vincingly) by Klein, pp. 133137.
CHAPTER 3
Bias in models of decay
3.1. Introduction and summary
This chapter is tedious and not crucial; it can be skipped without
great loss. I wrote it for two reasons: to develop the concept of
conjugate samples, and to show what I have claimed in the Preface:
that commonsense interpretations of intricate theorems in mathe
matical statistics can be found.
The main proposition of this chapter is that a singleequation model
of the form
Vi  V(t)  « + 7i*i(0 +•■•• + VHZ H (Jt) + u(t) (31)
in which some of the z's are not exogenous variables but rather lagged
values of y itself, necessarily violates Simplifying Assumption 6, and
hence that maximum likelihood estimates of a, 71, ... , 7# are
biased.
The concept of conjugate samples gives a handy and simpleminded
but entirely rigorous way to test for bias. It will be used again and
again in later chapters for models much more complicated than (31).
Equations involving lags of an endogenous variable are called
autoregressive.
52
3.2. VIOLATION OP SIMPLIFYING ASSUMPTION 6 53
Most satisfactory dynamic econometric models are multivariate
autoregressive systems, in other words, elaborate versions of (31),
and share its pitfalls in estimation. We shall see that the character of
initial conditions affects vitally our estimating procedure and that,
unfortunately, in econometrics the initial conditions are not favorable
to estimation, though in the experimental sciences they commonly are.
If the initial condition y(0) is a fixed constant Y, the maximum likeli
hood criterion leads to least squares regression of y(t) on y(t — 1),
and the resulting estimate for 7 is biased, except for samples of size 1.
If y(0) is a random variable, independent of u, then the maximum
likelihood criterion does not lead to least squares. If least squares are
used in this instance, they lead to biased estimates, again with the
exception of samples of size 1.
CONVENTION
The size S of the sample is given in units that correspond to the
number of points through which a line is fitted. Thus, if we observe
only 2/3 and 2/2, this is a sample of one; S = 1. If we observe
2/4, 2/3, and 2/2, this makes a sample of two points, S « 2, and so on. In
his proof of this theorem, Hurwicz (in Koopmans, chap. 15) would call
these, respectively, samples of size T = 2 and T = 3. The difference
is important when observations have gaps (are not consecutive). We
shall confine ourselves to consecutive samples. Appendix D deals
with the general case.
3.2. Violation of Simplifying Assumption 6
A lagged variable, unlike an exogenous variable, cannot be inde
pendent of the random component of the model. In (31) a lagged
value of y is necessarily correlated with some past value of u, because
2/(0 and u(t) are clearly correlated. Therefore, the very specification
of (31) rules out Simplifying Assumption 6.
But why worry about such models? Because (31) and its generali
zations express in linear form oscillations, decay, and explosions, which
are all of great interest and which are, indeed, the bread and butter of
physics, astronomy, and economics. For instance, springs behave
substantially like
2/(0 = a + yi y(t  1) + y 2 y(t  2) +u(t)
54 BIAS IN MODELS OP DECAY
and radioactive decay and pendulums like
y(t)  yy(t  1) + «(0 (32)
Business cycles are more complicated, involving several equations
like (31).
Why do we want unbiased estimates? There are excellent reasons.
If the world responds to our actions with some delay or if we respond
with delay to the world, in order to act correctly we need to know the
parameters accurately. How hot the water in the shower is now
depends on how far I had turned the tap some seconds ago. If my
estimate of the parameter expressing the response of water temperature
to a turn of the tap is biased, this means that I freeze or get scalded
or that I alternate between these two states, and, in any event, that
I reach a comfortable temperature much later than I would with an
unbiased estimate.
In economics, consumers, businesses, and governments act like a
man in a shower. The information they get about prices, sales,
orders, or national income comes with some delay and reflects the
water temperature at the tap some time ago. Moreover, it takes time
to decide and to put decisions into effect. If the decision makers
have misjudged how strong are the natural damping properties of the
economy, decisions and policy will either overshoot or undershoot the
mark, or alternate between overshooting and undershooting it, and
will cause uncomfortable and unnecessary oscillations in economic
activity.
Our discussion will now be confined to the simplest possible case
(32). Let consumption this year y(t) depend on consumption last
year y(t — 1), as in (32). If the relationship involved a constant
term a, we eliminate a by measuring y not from the origin but from its
equilibrium value. I shall illustrate my argument by a concrete
example where the true y has the convenient value 0.5 and where the
initial value Y is fixed and equal to 24.
In Fig. 6, line OP represents the exact relationship y t = 0.5y t i.
3,3. Conjugate samples
In model (31) witli fixed initial conditions, We can describe a sample
completely by mentioning two things: (1) what time periods it includes
3.3. CONJUGATE SAMPLES
55
and (2) what values the disturbances took on in those periods. For
example, (a,b,c,d) in Fig. 6 is completely described by
[ 8  1, 2, 3, 4]
[u. = 4, 0, 0, OJ
(a' t b f ,c' ,d f ) is described by
[s= 1, 2, 3, 4]
[u. = 4, 0, 0, OJ
and (a',&',d',e) by
s  1, 2, 4, 5
w. = 4, 0, 0,
If w is symmetrically distributed, all conceivable samples of size S
that one can draw from the universe can be arranged in conjugate sets.
We shall see that in each conjugate set the maximum likelihood estimates
16
14
•
12
^s^K
10
8
.
K^^^
y^^**^
ui4
a'
6
4
2
<< d . i i r
2 4 6 8 10 ■ 12 14 16 18 20 22 24 26 y,_ x
Fig. 6. Conjugate disturbances.
of 7 average to less than the true value of y and, therefore, that
maximum likelihood estimates are biased for all samples of size S.
These propositions need to be qualified if S = 1 or if 7 is not between
and 1; they are proved if u(t) is normally distributed, but only
conjectured if u(t) has some other symmetrical distribution.
For an introduction to the concept of conjugate samples, consider
5G
BIAS IN MODELS OF DECAY
Fig. 6, which depicts two of the many possible courses that events can
follow under our assumptions that y — 0.5 and Y = 24. One course
is represented by the points a, 6, c, d, e, . . . ; the other by a', &', c',
d', . . . . In the first course, the disturbance is equal to +4 in period 1
and zero thereafter. In the second course, it is —4 in period 1 and zero
thereafter. The samples S(+) = (a,b,c,d) and S( — ) = (a f ,b' ,c' ,d')
are conjugate samples, and form a conjugate set. Similarly, (a,b,c) and
(a',6',c') form a conjugate set.
To be conjugate, two samples must be drawn from the same time
span s = 1, 2, . . . , S; and the disturbances u a that contributed to
corresponding observations must have the same absolute value in the
two samples. This definition is for consecutive samples only. Appen
dix D extends it to the nonconsecutive case.
Thus, sample
«*U:J!].
forms a conjugate set all by itself.
has as its conjugate
Sample
s m 3, 4, 5, 6
u t = 0, 0, 17,
[ •  3, 4, 5, 6]
[u, = 0, 0, 17, OJ
f •  4, 5, 6, 71
[u,  0, 1, 0, 9J
has three conjugates, the following:
[ « = 4, 5, 6, 7] f.f4, 5, 6, 7]
[u. = 0, 1, 0, 9 J [w,«0, 1, 0, 9 J
s = 4, 5, 6, 7
u, = 0, 1, 0, 9
3.4. SOURCE OF BIAS 57
The greatest conjugate set of samples of size S is 2*, where k (0 <
k < S) represents the number of nonzero disturbances. If S » 4,
the largest conjugate set contains 16 samples.
3.4. Source of bias
In Fig. 6, line OR with slope ^[S(+)] = 0.6053 is the least squares
regression through the origin, and sample S(+) = (a,&,c,e£); and OR'
with slope t[S(— )] = 0.3545 is the same for the conjugate (a',b',c\d').
The line OR overestimates y because OR is pulled up by point a. The
line OR' underestimates y because of the downward pull of point a'.
As we have l A (0.6053 + 0.3545) = 0.4799 < y, the downward pull
is the stronger. But why? Because point a is accompanied by 6, c, d,
and a! by b' f c', d'. The primed points b' t c' f d' are closer to the origin
than the corresponding unprimed points; hence, their "leverage" on
their least squares line OR' is weaker than the leverago of the unprimed
points on theirs (line OR). It is impossible for a' to be accompanied
by b, c, d, because all future periods must necessarily inherit whatever
impulse was first imparted by the random term of period 1. Points &',
c ; , d' inherit a negative impulse, and points 6, c, d inherit a positive
one.
Another way of stating this is by referring to (31). In (31) one
of the z's (say, z 4 ) is a lagged value of y (say, the lag is 2 time periods).
It follows that Zi(t) is correlated with the past value of the disturbance
u(t — 2), since y(t) is clearly correlated with u(t).
All the proofs of bias later in this chapter and in Appendix D are
merely fancy versions of what I have just shown for this special case.
When conjugate sets are large, arguments from the geometry of Fig. 6,
though perfectly possible, become confusing, and so we turn to
algebra.
With fixed initial condition 2/(0) = F, the maximum likelihood esti
mate of 7 is the least squares estimate
^ V'Vi
1  ^ (33)
l«
2
1
58 BIAS IN MODELS OF DECAY
3.5. Extent of the bias
From (33) and (32),
1  7 + ^ (34)
We write the above fraction iV//>. We shall see that the bias N/D
varies with the true value of 7, the size of the sample, and the size
of the initial value F. For instance, in small samples it is almost
25 per cent; in samples of 20 observations, it is about 10 per cent of
the true value of 7. It never disappears, no matter what value true 7
may have or how large a sample one takes.
By applying (32) repeatedly and letting P, Q, and R stand for
polynomials, we get
2V =
(wi + 7W2 4 y 2 u z f
• •
• +
yBhisW + P(u u . . .
,u 8 )
D =
(1 + 7 2 4 7 4 + • • •
+
^2(S
«)F« + YQ(y,u h . . .
,u s ~
.)
+ R(u h
. . .
,1*8
l)
By considering N/D, one can establish that the bias is aggravated the
further 7 is from 41 or — 1 and the smaller the sample. Bias exists
even when 7 = ±1 or when 7 = 0; the latter is truly remarkable,
since the model is then reduced to y(t) — u(t). Since 2V is a linear
function of Y and the always positive denominator D is a quadratic
function of F, the bias N/D can be quite large for certain ranges of F.
The above results generalize to model (31), although it is not easy
to say whether the bias is up or down.
3.6. The nature of initial conditions
The following fantastically artificial example illustrates the concept
of conjugate samples and what it means for initial conditions 2/(0) to
be random or fixed.
An outfit that runs automatic cafeterias has its customers use,
3.6. THE NATURE OP INITIAL CONDITIONS 59
instead of coins, special tokens made of copper. The company has
several cafeterias across the country, but its customers rarely think
of taking their leftover tokens with them when they travel OP move
from city to city. As there is at most one cafeteria per city, each
cafeteria's tokens are like independent, closed monetary systems.
Let us look at a single cafeteria of this kind.
Originally it had coined a number of brandnew tokens and put
them in circulation, using y(0) pounds of copper. Thereafter, the
amount of copper in the tokens is subject to two influences. (1) To
begin with, the tokens wear out as they are used. The velocity of
token circulation is equal in all cities, and customers' pockets, hands,
keys, and other objects that rub against the tokens are equally abra
sive in all cities. Thus, in each city, year t inherits only a part 7
(0 < 7 < 1) of the coppor circulating in the previous year* (2) In
addition to the systematic factor of wear and tear, random influences
are at play. First, some customer's child now and then swallows a
token; this disappears utterly from circulation into the city's lowers.
However, occasionally there is an opposite tendency. An amateur
but successful counterfeiter mints his own token now and then, or a
lost token is found inside a fish and put back into circulation. So
the copper remaining in circulation is described by tha Itochastic
model (32). The problem for the company is how to estimate the
true survival rate of its tokens.
It is very important to interpret correctly our first assumption that
"u(t) is a random variable in each time period t." It means that u(t)
is capable of assuming at least two values (opposites, if u m sym
metrical) in the same period of time. But how can it? Here we
need a concept of conjugate cities analogous to conjugate samples.
Imagine that the only positive disturbances come from one counter
feiter and that the only negative disturbances come from one child,
the counterfeiter's child, who swallows tokens. The counterfeiter is
divorced, the child was awarded to the mother, and the two parents
always live in separate cities, say, Ames and Buffalo; but who lives
where in year t is decided at random. Ames and Buffalo are conjugate
cities, because, when one experiences counterfeiting, +u(t) t the other
necessarily experiences swallowing, — u(t). If there were more families
like this one, the set of conjugate cities would have to expand enough
00 BIAS IN MODELS OF DECAY
to accommodate all permutations of the various values that ±u(t)
is capable of assuming.
We have fixed initial conditions if each cafeteria starts with the
same poundage, and random initial conditions when the initial pound
age is a random variable. To estimate the token survival rate,
different procedures should be used in the two cases.
3.7. Unbiased estimation
Unbiased estimation of 7 is possible only if the initial copper endow
ment is a fixed constant Y. The only unbiased estimate is given by
the ratio of the first two successive ?/'s using data from a single city:
~ _ 2/(1) _ 2/(1) ,0 KN
7 " W) " ~Y (3 " 5)
which is a degenerate least squares estimate.
This result is really startling. It says that we must throw out any
information we may have about copper supply anywhere, except in
year and year 1 in, say, Ames. Unless we do this we can never
hope to get an unbiased estimate. Estimating 7 without bias when
each city starts with a different amount of copper is an impossible
task. A complete census of copper in all cities (i) in two successive
years would give the correct (not just unbiased) estimate
 T I !»_ = y
2 2/.<0
y m i
I y<«  i) ' I *» ~ 1)
i i
We can draw another fascinating conclusion: If we have the bad
luck to start off with different endowments, we can never get an
unbiased estimate of 7. But suppose we find that the endowments
of all cities happen to be equal later, say, in period t — 1. Then all
we have to do is wait for the next year, measure the copper of any
one city, say, Buffalo, and compute the ratio
?  ^ (36)
Vti
3.7. UNBIASED ESTIMATION 61
which is an unbiased estimate. (Where the information would mine
from, that all cities have an equal token supply in y©ar i — 1, is
another matter.)
The experimental scientist is, however, free from such predicaments.
If he thinks radium decays as in (32), then he can make initial con
ditions equal by putting aside in several boxes equal lumps of radium.
Then he can let them decay for a year, remeasure them, apply (36)
to the contents of any one box, and average the results. Any One box
gives an unbiased estimate. Averaging the contents of several boxes
gives an estimate that is efficient as well as unbiased.
The econometrician cannot control his initial conditions in this way.
If he wants an unbiased estimate, he must throw away, as prescribed,
most of his information, use formula (36), and thus get an unbiased
and inefficient estimate. Or else he may decide that he wants to
reduce the variance of the estimate at the cost of introducing some
bias; then he will use a formula like that of Exercise 3.C below or
some more complicated version of it.
Autoregressive equations are related to the moving average, a tech
nique commonly employed to interpolate data, to estimate trends,
and to isolate cyclical components of time series. The statistical
pitfalls of estimating (32) plague time series analysis, and they are
not the only pitfalls. The last chapter of this book returns to some
of these problems.
Exercises
3.A Prove that (35) is unbiased.
3.B Prove that y = y(2)/y(l) is biased.
3.C Prove that y = [y{2) + y(l)]/[y(l) + Y] is biased,
3.1) Let u t in (32) have the symmetrical distribution q(u f ) with
finite variance. "Symmetrical" means q(u t ) = q( — u t ). Then the
likelihood function of a random consecutive sample is {f q(u t ).
Prove that the maximum likelihood estimate of y is obtained by
minimizing the expression Y log g(^,), where the ft, are the vertical
a
deviations from the line that we are seeking.
3.E By the method of conjugate samples or by any other method,
62 BIAS IN MODELS OP DECAY
prove or disprove the conjecture that the estimate of Exercise 3.D is
biased.
Further readings
The reader who wants to see for himself how intricate is the statistical
theory of even the simplest possible lagged model (32) may look up "Least
Squares Bias in Time Series," by Leonid Hurwicz, chap. 15 of Koopmans,
pp. 3G5383. Tintner, pp. 255260, gives examples and shows additional
complications.
CHAPTER 4
Pitfalls of simultaneous
interdependence
4.1. Simultaneous interdependence
"Everything depends on everything else" is the theme song of the
Economic and the Celestial Spheres. It means that several con
temporaneous endogenous variables hang from one another by means
of several distinct causal strings. Thus, there are two causal (more
politely, functional) relations between aggregate consumption and
aggregate income : Since people are one another's customers, consump
tion causes income, and, since people work to eat, income causes
consumption. The two relationships are, respectively, the national
income identity in its simplest form
Vt = c t (41)
and the (unlagged) stochastic consumption function in its simplest
form
c t  a + py t + u t (42)
Wo can imagine that causal forces flow from the right to the loft of
the two equality signs.
63
64 PITFALLS OP SIMULTANEOUS INTERDEPENDENCE
The moral of this chapter is that, if endogenous variables, like c
and y, are connected in several ways, like (41) and (42), every
statistical procedure that ignores even one of the ways is bound to be
wrong. The statistical procedure must reflect the economic inter
dependence.
4.2. Exogenous variables
I shall not vouch for the Heavens, but in economics there are such
things as exogenous variables. A variable exogenous to the economic
sphere is a variable, like an earthquake, that influences some economic
variables, like rents and food prices, without being influenced back.
The random term u is, ideally, exogenous — though in practice it is a
catchall for all unknown or unspecified influences, exogenous or endoge
nous. One thing is certain: Earthquakes and such are not influenced
by disturbances in consumption. Indeed, the definition of an exoge
nous variable is that it has no connection with the random component
of an economic relationship.
My prototype exogenous variable, investment z, is not really
exogenous to the economic system, especially in the long run, but we
shall bow to tradition and convenience for the sake of exposition.
4.3. Haavelmo's proposition
The models in this chapter, like the singleequation models treated
m far, (1) are linear and (2) have all the Simplifying Properties.
Therefore, they are subject to all the pitfalls I have pointed out so
far. Unlike the models of Chaps. 1 to 3, the new models each contain
at least two equations. Most of my examples will have precisely
two (and not three or four) for convenience only, since the results
can easily be extended.
New kinds of complication arise when a second equation is added.
1. The identification problem
It is sometimes impossible to estimate the parameters — this problem
is sidestepped until Chap. 6.
4.3. HAAVELMO'S PROPOSITION 65
2. The Haavelmo 1 problem
The intuitively obvious way of estimating the parameters of a
twoequation model is wrong, even in the simplest of cases, where one
of the equations is an identity. We shall see that pedestrian methods
are unable to estimate correctly the marginal propensity to consume
out of current income, no matter how many years of income and
consumption data we may have. Even infinite samples overestimate
the marginal propensity to consume. This difficulty is as strategic
as it sounds incredible. It means that the multiplier gets overesti
mated and, hence, that counterdepression policies will undershoot full
employment and counterinflation policies will be too timid. Because
of bad statistical procedures, the cure of unemployment or inflation
comes too slowly.
The model is as follows:
c t = a ■+■ $y t j u t (consumption function) (42)
c t + zt — Vt (income identity) (43)
where z t (investment) is exogenous, and u t has all the Simplifying
Properties. 2 We shall illustrate by assuming the convenient values
a = 5, = 0.5.
In Fig. 7, line FG represents the true relation c t — 5 + 0.5y t . When
the random disturbance is positive, the line moves up; with negative
disturbance, it moves down. Lines HJ and KL correspond, respec
tively, to random errors equal to +2 and —2. OQ f the 45° line
through the origin, represents equation (43) for the special case in
which investment z is zero. In the years when investment is zero,
the only combinations of income and consumption we could possibly
observe will have to lie on OQ, because nowhere else can there be
equilibrium. If, for instance, in years 1900 and 1917 investment had
1 For reference to Haavelmo, see Further Readings at the end of this chapter.
2 To be specific, Assumption 6 in this case requires that u and z shall not influence
each other, either in the same time period or with a lag. But the random term u
cannot be independent of y. The reason is that a and are constants, z is fixed
outside the economic sphere, and u comes, so to speak, from a table of random
numbers; if this is so, then, by equations (42) and (43), a, /?, z, ana u necessarily
determine y (and c). Thus variable y is not predetermined but codetermined
with c. These statements summarize and anticipate the remainder of the chapter.
66
PITFALLS OP SIMULTANEOUS INTERDEPENDENCE
been zero and if the errors had been +2 and —2, respectively, then
points P and P' would have been observed.
Let us now suppose that in some years investment z t equals 3.
Line MN (also 45° steep) describes the situation, which is that
c t + 3 » yt. With errors u t = ±2, the observable points are at R
and R'. With errors ranging from —2 to +2, all observable points
fall between R and R'.
Fig. 7. The Haavelmo bias.
Let us now pass a leastsquaresregression line through a scatter
diagram of income and consumption, minimizing squares in the vertical
sense and arguing that from the point of view of the consumption function
Income causes consumption, not vice versa. Such a procedure is
bound to overestimate the slope p of the consumption function and
to underestimate its intercept a. This is Haavelmo' s proposition.
The least squares line (in dashes) corresponds to observation points
that lie in the area PP'R'R. It is tilted counterclockwise relative to
the true line FG because of the pull of "extreme" points in the corners
next to R and P'. The less investment z ranges and the bigger the
stochastic errors u are, the stronger is the counterclockwise pull,
because lines PP' and RR' fall closer together.
This overestimating of /3 persists even if we allow investment to
4.4. SIMULTANEOUS ESTIMATION 67
range very far* Though it is true that the parallelogram PP'R'R gets
longer and longer toward the northeast (say, it becomes PW'P') the
fact remains that V and P', the extreme corners, help to tilt the least
squares line upward. This suggests that perhaps we ought to mioi
mize squares not in a vertical direction but in a direction running from
southwest to northeast. In this particular case (though not generally)
diagonal least squares are precisely correct and equivalent to the
procedure of simultaneous estimation described in the following section.
4.4. Simultaneous estimation
We know that two relations, not one, account for the slanted posi
tion of the universe points in Fig. 7. Had the consumption function
been at work alone, a given income change Ay would result in a change
in consumption Ac = Ay. Had the income identity been at work
alone, then to the same change in income would correspond a larger
change in consumption Ac = Ay. In fact, both relations are at work.
Therefore, the total manifest response of consumption to income is
neither Ac = Ay nor Ac = Ay } but something in between. This is
why the line in dashes is steeper than FG (and less steep than OQ).
In order to isolate the 0. effect from a sample of points like PP'R'R,
both relations must be allowed for. This is done by rewriting the
model :
01 I P I U t /A M\
v*  r=Tf + T=^ Zt + n^ <">
The term u t /(l — 0) has the same properties as u t except that it
has a different variance. Therefore the error term in the new model
has all the Simplifying Properties. Either of the new equations con
stitutes a singleequation model with one endogenous variable (c and y,
respectively) and one independent variable (z in both cases). There
fore, the estimating techniques of Sec. 2.5 can be applied to the sophisti
cated parameters a' = «/(l  0), 71 = 0/(1  0). 72 = 1/(1  0).
Denote these estimates by the hat ( A ). For the naive least squares
estimate of a and 0, derived from regressing c on y, use the bird ( v ).
Let us now express these estimates in terms of moments, and let us do
it for 0, 71, and 72 only, leaving aside a and a'.
68 PITFALLS Ot SIMULTANEOUS INTERDEPENDENCE
a _ ■ ft _. Mc.
1  j§ ™*«
1  m„
 Is a biased estimate of ft, because
P 33! "i i . . SSI — " ess n r» — ■ ■
lUyy TJfiyy TTly V
and it is known that
/3 is inconsistent, because
e 5bs ^ o (46)
a W(«h^+«)(>+af«) ^ ft + (1 + j3)(m u< /m„) 4 Tn uu /m g g
^d+a+u)(i+a+u) 1 + 2m ut /m zt + m uu /m tt
The various moments m cv , m uu , etc., vary in value, of course, from
sample to sample. As the sample size approaches the population
size, however, m ul approaches cov (u,z) = 0, m uu approaches var u > 0,
and m„ approaches var z > 0. Therefore,
Pl im ^ = fl + vartt/vars
1 + var w/var z
Exercises
4. A In similar fashion prove that Plim a < a.
4.B Interpret (46).
4.C Show that 1/(1 — /§) is a biased estimate of 1/(1 — ft).
Hint: manipulate the expression
1 1
1 — $ l — m cy /m vv
and use the fact that e(m v ,/m„) = 1/(1  ft).
4.D Prove that fi is an unbiased and consistent estimate of
0/(1  0).
4.4. SIMULTANEOUS ESTIMATION
69
4.E Prove that ^ 2 is an unbiased and consistent estimate of
1/(1  fl.
4.F Prove that 1 1 and f 2 yield a single compatible estimate of 0,
which we call 0; $ ^m et /m yy .
4.G Prove that $ is a biased but consistent estimate of #.
4.H From the facts that /§ « f m u ,/rn yv and that # = $ + m uy /m nt
argue that the bias of $ is less serious than the bias of #.
Digression on directional least squares
What do we get if, in Fig. 7, we minimize the sum of th© square
deviations not vertically but from the southwest to the n§?thP&st?
Let P(y,c) in Fig. 8 stand for any point of the sample; p ® tZ is
Q
zA
/5s
P
u
%*>^M /
/\45*
'
y t
Fig. 8. Directional least squares.
parallel to the 45° line. Let be the angle of inclination of the
true consumption function; that is, let tan 6 be the slope of the
curve a + Py = 0. Then in triangle PZM, from the law of sines,
we have
u
sin <f>
from which it follows that
sin (90 + 6)
Vt
V2 V2 ,
l^
1/3
to)
70 PITFALLS OF SIMULTANEOUS INTERDEPENDENCE
Then,
2j P ' " (1  fl )l 2/ M ' " (1 Z 0)2 m (o«/» V )(o«HJv)
Setting c = */ — *,
J Pi = ( j 1)2 l«W. + tf  !) 2 ^ + 208  1)«J
Minimizing Spj with respect to — 1, we obtain
1 _ m yt
1 — m M
that is to say, the same expression that we found for *? 2 .
4.5. Generalization of the results
Section 4.3 showed the pitfalls of ignoring the income identity in
estimating the consumption function, and Sec. 4.4 showed how to get
around this difficulty by the technique of simultaneous estimation,
which takes into account the entire model even though the investigator
may be interested in only a part. Chapters 5 to 9 deal with the
intricacies of simultaneous estimation and various approximations
thereof.
To prepare the way, let us enlarge the model slightly, by making
Investment respond to income. The new model is
c t = a + 0y, + u t (42)
it " Zi + 7 + tyt + V t (47)
c t + it = Vt (48)
where z t is autonomous investment; i t is total investment; and u t} v t are
random disturbances independent of each other and of present and
past values of z. The last sentence is a statement of Simplifying
Assumption 7, which will be explained and justified in the next chapter.
Letting s t — yt — c t stand for saving, we obtain from (42) the
saving function
8 t  a + (1  fi)y t  n (49)
Figure 9 shows saving SS and investment 77 as functions of income,
with zero disturbances (thick lines) and disturbed by ±u f ±v, respec
tively (thin lines), and with the usual (stable) relative slopes.
4.6. BIAS IN THE SECULAR CONSUMPTION FUNCTION
71
Naive least squares applied to Fig. 9 underestimate the slop© I —
of SS (as it underestimates the slope of OQ in Fig. 7) and, hence, again
overestimates the marginal propensity to consume.
The Haavelmo bias again.
4.6. Bias in the secular consumption function
We have shown that naive curve fitting overestimates the slope of
the consumption function, even with large samples and whether or not
investment is a function of income. Statistical fits of the secular
consumption function give a slope varying from over 0.95 to nearly
1.0, contradicting the lower figures given by budget studies, introspec
tion, and Keynes's hunch. To reconcile these facts, consumption
theories of imitation, irreversible behavior, and more and more
explanatory variables have been invoked. A large part of what these
ingenious theories account for can be explained by Haavelmo's
proposition.
Further readings
Trygve Haavelmo's proposition was, apparently, stated first in "The
Statistical Implications of a System of Simultaneous Equations" (Econo
melrica, vol. 11, no. 1, pp. 112, January, 1943), but a later article of his
applying the proposition to the consumption function has attracted far more
attention. This has appeared in three places: Trygve Haavelmo, "Methods
of Measuring the Marginal Propensity to Consume" (Journal of the American
Statistical Association, vol. 42, no. 237, pp. 105122, March, 1947); reprinted
72 PITFALLS OF SIMULTANEOUS INTERDEPENDENCE
as Cowles Commission Paper 22, new series; and again as chap. 4 of Hood,
pp. 7591. Haavelmo gives numerical results and confidence intervals for
the parameter estimates.
Jean Bronfenbrenner, "Sources and Size of Leastsquares Bias in a Two
equation Model," chap. 9 of Hood, pp. 221235,. extends Haavelmo's propo
sition to three more special cases. An early article by Lawrence R. Klein,
"A Postmortem on Transition Predictions of National Product" {Journal of
Political Economy, vol. 54, no. 4, pp. 289308, August, 1940), puts the
Haavelmo proposition in proper perspective, as indicating only one of the
many sources of malestimation.
Milton Friedman, A Theory of the Consumption Function (New. York:
National Bureau of Economic Research, 1957), also compares and discusses
rival measurements of consumption, but his main concern is to test the
Permanent Income hypothesis and to refine the consumption functions, not to
discuss econometric pitfalls. It contains valuable references to the literature
of the consumption function.
According to Guy H. Orcutt, " Measurement of Price Elasticities in Inter
national Trade" {Review of Economics and Statistics, vol. 32, no. 2, pp. 117132,
May, 1950), Haavelmo's proposition explains why exchange devaluation had
been underrated as a cure to balanccofpayments difficulties. Orcutt con
fines mathematics to appendixes and gives many further references.
Tjalling C. Koopmans in "Statistical Estimation of Simultaneous Economic
Relations" {Journal of the American Statistical Association, vol. 40, no. 232,
pt. 1, pp. 4484G6, December, 1945), discusses the Haavelmo proposition with
the help of a supplyanddcmand example and with interesting historical
comments. When the random disturbances are viewed not as errors of
observation clinging to specific variables but as errors of the econometric
relationship itself, then they affect all simultaneous endogenous variables
symmetrically, and Haavelmo's problem rears its head. The Koopmans
article is a good preview of the next chapter.
CHAPTER S
Many equation linear models
5.1. Outline of the chapter
The moral of Chap. 4 is this: if a model has two equations they
cannot be estimated one at a time, each without regard for the other,
because both take part together in generating the phenomena from
which we draw samples. This fact rules out, except in special cases,
the use of the pedestrinr technique of naive least squares. Both the
moral and the reasom ehind it remain in force as the number of
equations in the model 1. ^
The present chapter is rathe nimportant, and might be skipped or
skimmed at first. All its principles are implicit in Chap. 4.
The main task of Chap. 5 is to systematize the study of many
equation linear models. First we present some standard and effort
saving notation (Sec. 5.2). Next, we review the Simplifying Assump
tions, which were originally introduced for oneequation models in
Chap. 1, to see precisely how they extend to the general case (Sec. 5.3).
With two or more equations, a seventh Simplifying Assumption is
required, that of stochastic independence among the equations (Sec. 5.4).
The presence of several simultaneous equations in a model compli
cates the likelihood function with the term det J, which we have
73
74 MANYEQUATION LINEAR MODELS
ignored until now; in intricate fashion det J involves the parameters of
all equations in the system. (The last proposition merely restates the
moral of Chap. 4.) The digression on Jacobians explains what det
J is doing in the likelihood function.
If we heed the moral to the letter and take det J into account, we
get into awfully long computations (see Sec. 5.5) in spite of all our
original Simplifying Assumptions.
Whether computations are long or short, it pays to lay them out in
an orderly way. This is a general precept, of course, but its value
stands out most dramatically in the present chapter. It pays not only
to do computations in an orderly manner but also to perform some
redundant ones just in case you might want to check some alternative.
Econometricians normally settle down to a specific model only
after much experimentation. And, further, redundant computations
become necessary when we want to estimate a given promising model
by increasingly refined techniques. The wisdom of performing the
redundant computations will become fully apparent only after we have
dealt with ovcridentified systems, instrumental variables, limited
information, and Theirs method (Chaps. 6 to 9).
5.2. Effortsaving notation
It pays to establish once and for all a uniform notation for complete
linear models of several equations. These are conventions, not
assumptions.
The endogenous variables are denoted by y'a. There are G endoge
nous variables, called y\ y yt t ■»••■$ Vo an( ^> collectively, y. y is the
vector (2/1,2/2, . • . ,2/c). We use g (g « 1, 2, . . . , G) as running
subscript for endogenous variables.
The exogenous variables are denoted by z'q. There are H exoge
nous variables, called Zi, z% % . . . % zn f and z is their vector. These
may be lagged values of the y'a only by special mention. The running
subscript of an exogenous variablo is h «■ 1, 2, . . . , H.
All definitions have been solved out of the system, so that there are
exactly G equations, all stochastic, with errors U\, u%, . . . , uq.
u = (ui f u 2t  . . . ,u G ). We speak of the gth equation.
The coefficients of the y'a are called /3s, and those of the z'a are called
5.2. EFFORTSAVING NOTATION
75
7s. They bear two subscripts: the first refers to the equation, the
second to the variable to which the parameter corresponds.
We get rid of the constant term (if any) by letting the last exogenous
variable zh be identically equal to 1 ; its parameter y H then becomes
the constant term. In most applications we shall not bother to write
the constant term at all. Either it is in the last term y HZH = 1, or it
has been eliminated by measuring all variables from their means.
B and T represent the matrices of coefficients in their natural order:
B 
■]8ii
021
012
022
001 002
010"
020
^GQ~
r =
7n 7i2
721 722
7oi 702
71//" 1
72/f
7o#.
A stands for
B is always square and of size G XG;T is of sizeCr X H.
the elements of B and r set side by side:
A = [Br]
that is to say, for the matrix of all coefficients in the model, whether
they belong to endogenous or exogenous variables. A is of size
GX(G + H).
x stands for the elements of y and z set side by side.
x = (2/1,2/2,
,2/o J Zi,Z2,
>zh)
that is to say, x is the vector of all variables, whether endogenous or
exogenous, but in their natural order.
ai stands for the first row of A, on for the second row, etc. ; similarly,
for gi, 5a, ... , 5o, 71,72, • . • , Yo That is, a lowercase bold Greek
letter with a single subscript g represents (some of) the parameters of a
single equation (the oth) of the system.
We reduce the number of parameters to bo estimated by dividing
each equation by one of its coefficients. This does not affect the
model in any other way. We use the pth coefficient of the oth equation
for this, so that W = L Henceforth we shall always take matrix 5
in its ''standardized form"
76
MANYEQUATION LINEAR MODELS
B =
' 1 0u
021 1
01O~
020
001 0(72 ' ' ' 1 
A model can be written in a variety of forms:
1. Explicitly, as below (time subscripts omitted) :
yi+ Pay % +
02i2/i + y% +
+ 01(72/(7 + 711*1 + 71222 +
+ 0202/0 + 72l2l + 722^2 +
4 yiHZn = ui
+ ImZn  u 2
0(712/1 + 0(722/2 +••••+ Jto + 701«1 + 7(72«2 +
2. In extended vector form)
?iy + yi z = w i
?2y + Y2Z = u z
5oy 4 yoz ■ wo
3. In condensed vector form:
ait = Wi
at* = Wa
+ 7off*# ■ Wo
(5D
(52)
(53)
€10% — Uq
4. In extended matrix form:
By + Tz = u
5. In condensed matrix form:
Ax = u
(54)
(55)
5.3. THE SIX SIMPLIFYING ASSUMPTIONS GENERALIZED 77
Note that, when the context is clear, bold lowercase letters stand
either for a row or for a column vector. S
Finally, <r gh (t) stands for the co variance of u g (t) with u h (t) ; #&(£) is
the matrix of these co variances; and 6° h (t) is its inverse.
5.3. The Six Simplifying Assumptions generalized
A laconic mathematician can generalize the Six Simplifying Assump
tions with a stroke of brevity by saying that they continue to apply if
we replace the symbol u by u. Our task is to interpret this in terms of
economics.
In Chap. 1, I discussed the Six Simplifying Properties when there
was a single equation in the model and, therefore, a single disturbance.
Now we have one disturbance for each equation, and u is the vector
made up of them, u(t) = (wi(0,W2(0> • • • » w o(0)»
i
Assumption 1
"u is a random variable" means that each u (t) is a random variable,
that is to say, that all equations remaining after solving out the
definitions are stochastic.
1
Assumption 2
"u has expected value 0" means that the mean of the joint distribu
tion is the vector = (0,0, . . . ,0), or that each u has zero expected
value.
Assumption 3
"u has constant variance" means that the covariances
(T 0h = cov (u 0} u h )
of the several disturbances do not vary with time.
Assumption 4
"u is normal" means that Wi(0, u 2 (t), . . . , u Q (t) are jointly
normally distributed.
Assumption 5
"u is not autocorrelated" means that there is no correlation between
the disturbance of one equation and previous values of itself.
78 MANYEQUATION LINEAR MODELS
Assumption
"u is not correlated with z" means that no exogenous variable — in
whichever equation it appears — is correlated with any disturbance,
past, present, or future, of any equation in the model.
On these assumptions, the likelihood function of the sample is
s
L  (2»)"»(det J) s (det fa*])"*' 2 exp {  % J u,[<^]u.} (56)
9—1
which should be compared with (22). The analogy is perfect. The
expression in the curly braces can also be written
« g h
Another way to write the likelihood function is
^JJ« B ( 5 K« A W (57)
L m (2ir) s >*(det J) s (det M)" 5 ' 2 exp { \i £ Ax(«)[d'*]x(a)A} (58)
«
which brings out the fact that L is a function (1) of ail the unknown
parameters @ h, y h, <r vh y <T h) and (2) of all the observations \(s) (s = 1, 2,
. . . , S). The function's logarithmic form
S 1 log L «  Y 2 log 2t + log det J  Y 2 log det [4 gh ]
8
Vl^I Ax(s)[d'*]x(s)A (59)
is easier to use.
5,4. Stochastic independence
The seventh Simplifying Assumption: 6 gti is a diagonal matrix, or
cov (u 0) u h ) = for g j* h
is not obligatory, but it is easy to rationalize. It states that the
disturbance of one equation is not correlated with the disturbance in any
other equation of the model in the same time period — something quite
different from Assumption 6.
Recall that each random term is a gathering of errors of measure
5.4. STOCHASTIC INDEPENDENCE 79
ment, errors of aggregation, omitted variables, omitted equation!, and
errors of linear approximation. Assumption 7 states that either
(1) the #th equation and the /ith equation are disturbed by different
random causes, or (2) if they are disturbed by the same causes, dif
ferent "drawings" go into u g (t) and u h (t). This assumption is dearly
inapplicable in the following situations:
1. In year t, all or nearly all statistics were subject to larger than the
usual errors, because of a cut in the budget of the Statistics Bureau.
2. Errors of aggregation affect mainly national income (because of
shifts in distribution), and national income enters several equations of
the model.
3. Omitted variables (one or more) are known to affect two (qf more)
equations. For instance, weather affects the supply of watermelons,
cotton, and whale blubber. Now if the model contains equations for
watermelons and blubber, the inclusion of weather in the random term
does not hurt, because relatively independent drawings of waather
(one in the Southeast, one in the South Pacific) affect these two
industries. However, if watermelons and cotton are included in the
model, both of these are grown in the same belt, the weather affecting
them is one and the same, and Assumption 7 is violated.
Assumption 7 simplifies the computations (1) because it leaves fewer
covariances to estimate, (2) because det 6 oh becomes a simple product
Uffggj and (3) because all the cross terms 1 in (57) drop out. This can
reduce computations by a factor of 2 or 3 for a model of as few as three
equations and by a much greater factor for larger systems.
Digression on Jacobians
The likelihood function involves a term expressed m det J» the
Jacobian of the functions u, say, with respect to the variables y;
we have disregarded det J until now, since we have taktn it on
faith to be equal to 1. This is no longer true in a manyequation
model. Here J is a matrix of unknown parameters, the same /Js,
in fact, that we are trying to estimate with the likelihood function.
The main ideas behind J are three:
1. If you know the probability distribution of a variable u (or
several variables u h w 2 , . . . , Uq), then you can find the proba*
1 Those for which g ^ h.
80 MANYEQUATION LINEAR MODELS
bility distribution of a variable y related to u functionally (or of
several y's related functionally to the w's).
2. If the w's and t/'s are equally numerous and if the functions
connecting the two sets are onetoone, continuous, and with con
tinuous first derivatives, then the matrix J of all partial deriva
tives of the form du/dy will have an inverse.
3. If conditions I and 2 arc satisfied, then wo can calculate the
joint probability distribution q of the y'a from the known joint
probability distribution p of the w's (omitting the subscript t) as
follows:
(510)
p(ui,u 2 , . . • ,u G ) du\ du 2 ' ' • dug
= det J • p(u h u 2 , . . . ,u ) dy x dy 2 * • • dy
or q(yi,y2, . . . ,:!/o) dyi dy 2 • • • dy a
= det J • p(u h u 2i . . . ,u ) dyi dy 2 • • • dy G
I shall illustrate these three ideas by examples.
Example 1
Let u be a single variable whose probability distribution we
know to be as follows:
Value of u
Probability p(u)
4
0.1
3
0.2
1
0.4
3
0.3
Let y be related functionally to u as follows:
y(u)  w 2  4w + 3 (511)
As u takes on its four values, y takes on the corresponding
values ?/(4) = 35, t/(3) = 24, y{\) = 0, r/(3) = 0. Since we
know how often u is equal to —4, —3, 1, and 3, we can find how
oiten y is equal to oo, zi, ana
u.
Value of y
35
24
Probability ^(?/)
0.1
0.2
0.7
5.4. STOCHASTIC INDEPENDENCE $1
Example 2
The same can be done with several y's and u's conneetid by m
appropriate set of functions, for instance,
2/i = t*i  Zu\  Ui
2/2 = e~ ui + log u 2
provided the probability distribution p(wi,Wj,Wa) is known.
Relation (511) is not onetoone, since, for every value of y t U
can have two values. Accordingly, in Example 1 the second
condition is violated, and the Jacobian is undefined. The same is
true for Example 2.
Whenever the functional relation between the w's and y's is
onetoone, [du/dy] and [dy/du] are singlevalued and their
determinants multiply up to the number 1.
Example 3
y(u) = 3w f log u — 4
Though it is very hard to express w in terms of y, we know
that, since dy/du = 3 + \/u = (3w + l)/w, the Jacobian
j = du/dy  w/(3w + 1).
Example 4
2/i « — Ui + w 2
2/2  e~ ui + log u 2 + 5
(5*12)
Here we can compute det J from knowledge of
det
raj i
L^WJ W2
+ e tt »
since it follows that det J = u 2 /(u 2 e" x — 1). Therefore, by
(510), the probability distribution of the y's is
9(2/1,2/2) d?/i efo/2 =
^2
w 2 e" WI  1
p(u h u 2 ) dyx dy*
Now, what relevance does all this have to econometrics?
Very simple. Let 2/1, yt t • • • > Vq De endogenous variables, and
let u h u 2 , . . . , u Q be the random errors attached to the struc*
82 MANYEQUATION LINEAR MODELS
tural equations. The model's G equations are explicit functional
relations between the y's and the u'a, like (512). Directly, we
know nothing at all about the probability of this or that combina
tion of 2/'s. Nevertheless, (510) allows us to compute this proba
bility, namely, in terms of J and the probability distribution
of the u's. It turns out that the righthand side of (510) involves
only the parameters we seek, the observations we can make, and
the probability distribution p, which we have already specified
when we constructed the model.
If the structural equations are all linear, as in (51), the
matrix J of all partial derivatives of the form du/dy turns out
to be nothing but the matrix B itself.
1 #12 * * * 01G
021 1 * ' * 020
J
&?i 002 * ' ' 1 
 B
5.5. Interdependence of the estimates
Now that we know that J = B, we can both find the values /3, 7, <r
that maximize the likelihood function (59) and compute its actual
value. Actually we do not care how large L itself is.
Naturally, maximizing such a function by ordinary methods is a
staggering job; we won't undertake it. In fact, nobody undertakes it
by direct attack. We shall use (59) to answer the following question:
In order to estimate this particular parameter or this particular
equation, do we need to estimate all parameters? The answer,
generally, is yes.
Note first of all that the maximum likelihood method of estimating
B, r, and 6 h differs from the naive least squares method quite radically,
because the least squares method does not involve the term log det B
at all. In other words, the least squares method, if applied to the
model one equation at a time, omits from account the matrix B; it
does not allow the parameters of one equation to influence the estima
tion of the parameters of another; nor does it allow the covariances a h
5.6. RECURSIVE MODELS 83
to influence in the least the parameter estimates of any equation that
is being fitted.
Finally, the least squares technique estimates the covariances & g9
one at a time without involving any other covariance. Contrariwise,
in maximum likelihood, the estimates $ of one equation affect the $s
and 1 s of another; the &s of one equation affect the /3s and f s of another;
and one & affects another.
In a word, the sophisticated maximum likelihood method is very
expensive from the point of view of computations and is probably
more refined than the quality of the raw statistical data warrants.
Econometric theory is like an exquisitely balanced French recipe,
spelling out precisely with how many turns to mix the sauce, how
many carats of spice to add, and for how many milliseconds to bake
the mixture at exactly 474 degrees of temperature. But when the
statistical cook turns to raw materials, he finds that hearts of cactus
fruit are unavailable, so he substitutes chunks of cantaloupe; where the
recipe calls for vermicelli he uses shredded wheat; and he substitutes
green garment dye for curry, pingpong balls for turtle's eggs, and, for
Chalifougnac vintage 1883, a can of turpentine.
Two courses of action are open to the econometrician who is reluc
tant to lavish refined computations on crude data:
1. Use the refined maximum likelihood method, but reduce the
burden of computation by making additional Simplifying Assumptions.
2. Water down the maximum likelihood method to something more
pedestrian but not quite so naive as least squares. Limited informa
tion } instrumental variable, and other techniques are available; they
are the subject of Chaps. 7, 8, and 9.
5.6. Recursive models
If B is a triangular matrix, 1 the model is called recursive; and its
computation is lightened, because there are fewer 0s to estimate and
because det B = 1.
The economic interpretation of a recursive model is the following.
There is an economic variable in the system (say, the price of coffee
beans) that is affected only by exogenous variables (like Brazilian
weather) ; next, there is a second economic variable (say, the price of a
1 B is triangular if p g h = for all g < h.
84 MANYEQUATION LINEAR MODELS
cup of coffee) that is affected by exogenous variables (tax on coffee
beans) and by the one endogenous variable (price of coffee beans) just
mentioned. Next, there is a third economic variable (say, the number
of hours spent by employees for coffee breaks) that depends only! on
exogenous variables (the amount of incoming gossip) and (one or both
of) the first two endogenous variables but no others; and so on.
Exercises
5.A In the recursive system
2/i = 7*i + u
V2 = Pyi f yiZi + y 2 z 2 + v
let the Simplifying Properties hold for u, v with respect to the exoge
nous variables. Prove that, if $ is estimated by naive least squares,
that is, if
S sg ^(vi.gi.«i)(vi.gi.«i)
W (Vlt«.*l)(l/l.«l.*l)
then 5 is biased.
5.B In the recursive model
x t = Py t + u t
y t = yxt\ 4 v t
show that j5 and ^ are unbiased but that least squares applied to the
autoregressive equation obtained as a combination of the two equations
gives biased estimates.
Further readings
The notation of Sec. 5.2 is worth learning because it is becoming standard
among econometricians. It is expanded in Koopmans, chap. 2.
Jacobians are illustrated by Klein, pp. 3238. The mathematics of
Jacobians, with proofs, can be found in Richard Courant, Differential and
Integral Calculus, vol. 2, chap. 3 (New York: 1953), or in Wilfred Kaplan,
Advanced Calculus, pp. 00100 (Reading, Massachusetts: 1952).
Klein, p. 81, gives a simple example of a recursive model.
CHAPTER 6
Identification
6.1. Introduction
Identification problems spring up almost everywhere in econometrics
as soon as one departs from singleequation models. This chapter far
from exhausts the subject. In particular, the next two topics,
instrumental variables in Chap. 7 and limited information in Chap. 8,
are intimately bound up with it. The identification problem will arise
sporadically in later chapters.
Though this chapter is selfcontained, some familiarity with the
subject is desirable. I know of no better elementary treatment than
that of Tj ailing C. Koopmans, "Identification Problems in Economic
Model Construction," chap. 2 in Hood. I have chosen to devote this
chapter to a few topics which, in my opinion, either have not received
convincing treatment or have not been put in pedagogic form.
The main results of this chapter are the following:
1. There are several definitions of identifiability. I show their
equivalence.
2. Lack or presence of identification may be due (a) to the model's
a priori specification, (b) to the actual values of its unknown parame
ters, or (c) to the particular sample we happen to have drawn.
85
80 IDENTIFICATION
3. There are ways to detect overidentificatton and underidentifica
tion. These ways are not always foolproof. There are several ways
to remove over or underidentification.
4. In spite of the superficial fact that they are defined in analogous
terms, underidentification and overidentificatioh are qualitatively dif
ferent properties: the former is nonstochastic, the latter stochastic;
the former can be removed (in special cases) by means of additional
restrictions, the latter is handled by better observations or longer
computation.
6,2, Completeness and nonsingularity
The following discussion applies to all kinds of models, linear or not,
largo or small, but it will bo illustrated by this example:
2/i + YnZi 4 71222 + 71323 4 71424 = ui
0212/1 + 2/2 4 0232/3 4 721Z1 4 72222 4 72323 * Ui (61)
0312/1 4 2/3 4 73121 4 73222 = W 8
This model describes an economic mechanism that works somewhat
like this:
1. The parameters /? and 7 are fixed constants.
2. In each time period, someone supplies outside information about
the exogenous variables z.
3. In each time period, someone goes to a preassigned table of
random numbers, and, using a prescribed procedure, reads off some
numbers u h w 2 , u s .
4. All this is fed into (61).
5. Values for the endogenous variables, 2/1, 2/2, 2/s, are generated in
accordance with the resulting system.
The last step succeeds if and only if the linear equations resulting from
step 4 are independent. Otherwise there is an infinity of compatible
triplets (2/1,2/2,2/3). The model is complete if it can be solved uniquely
for (2/1,2/2,2/3); otherwise it is incomplete. To generate a unique
triplet it is necessary and sufficient that the matrix B be nonsingular,
meaning that no row of it is a linear combination of other rows.
The economic interpretation of singularity and nonsingularity
6.3. THE REDUCED FORM 87
is very simple. Each equation in (61) represents the behavior
of a sector of the economy, say, producers, consumers, bankers,
buyers, sellers, or middlemen. These sectors respond to exogenous
stimuli z and economic stimuli y. They may respond to exogenous
stimuli in any way whatsoever. In particular, it is quite permis
sible for them to respond in the same way to all exogenous stimuli
(711 = 721 = T3i, 712 = 722 = 732, etc.). But, if the matrix is to be
nonsingular, they should respond in different ways to the endogenous
stimuli. No sector may have the same parameters as another; no
sector's responses may be the average of two other sectors' responses.
No sector may be a weighted average of any other sectors, as far aa
economic stimuli are concerned.
To illustrate singularity, consider a simple economy which consists
of throe families responding to three economic stimuli but such that the
third family makes an average response. Then B is singular, and the
model containing the three families is incomplete. For nonsingularity
the sectors must be sufficiently unlike each other. In fact this is the
definition of sectors: that they are economically different from one
another.
Exercise
6. A Prove the following theorems by using the common sense of
the five steps of the above discussion: "If Assumption 7 is made,
then B is nonsingular," and "If B is singular, Assumption 7 cannot
hold." These two statements can be reworded: "An econometric
model is complete if and only if its sectors are stochastically inde* II
pendent." Appendix E proves this mathematically, but what is
wanted in this exercise is an "economic" proof.
6.3. The reduced form
Every complete linear model By f Tz = u can be reduced to
y = IIz + v. These two expressions are called the original form and
the reduced form. If it is complete, the original model (61) can be
reduced to
2/1 = 7TnZi f 7Ti 2 Z2 + ITi&i + TuZa + V\
2/2 = 7T2lZl + *"22Z2 + Vr&Z + ^24^4 + V% (62)
2/3 = 7T 3 lZl + A32Z2 + Tz&i + ^34^4 + #3
88 IDENTIFICATION
Some obvious properties of (62) are worth pointing out: Its random
disturbances Vi, v* f *>3 are linear combinations of the original random
disturbances'and share their'properties. However, the v's have different
covariances from the it's. In particular, the v's are interdependent
even if the u's were stochastically independent. (We seldom have to
worry about the precise relation among the w's and v's.) Unlike the
typical original form, each equation of the reduced form contains all
the exogenous variables of the model.
Each equation of the reduced form constitutes a model that satisfies
the Six Simplifying Assumptions of Chap, 1 and, therefore, may validly
be estimated by least squares; these estimates are called its. If it is
possible to work back from the its to estimate unambiguously the
coefficients /3, 7 of the original form, we shall call such estimates /3, f
and say that (61) is exactly identified. Finally, the coefficients of the
two forms (61), (62) are connected as follows:
— 7ll = 7TH —721 = ^21^11 + 7T21 + 0237T31 731 = /?3l7Tn + 7T 3 i
~7l2 = 1T12 —722 — 0217T12 + 7T22 + 023^32 732 = Pn^l* 4* ^32
~7l3 = ^13 —723 : = ^21^13 + T23 + ^23^33 = /33lTi3 4" ^33
— 714 = ^14 : = /?2l7Tl4 + 7T24 + 0237T34 = ^ZlTTu + ""34
(63)
It is possible, but messy, to solve for im, • • • , xt4 in terms of the
0s and 7s. The important fact is that, in general, all its are a priori
nonzero in the reduced form, even if many of the j8s and 7s are a priori
zero in the original form.
Relations (63) can be written much more compactly:
r = Bn (64)
6.4. Over and underdeterminacy
As a preview for the rest of this chapter, imagine that (61) is
complete. If so, its reduced form (62) exists and can be estimated by
least squares. Let the estimates be tfn, . . . , #34.
Now consider the leftmost column of equations in (63). Evidently
the 7s can be computed right away from the 7rs, uniquely and unam
biguously. We say, then, that 711, 712, 713, 7i4 are exactly identified.
Consider next the last two equations of (63) ; they give rise to two
6.5. BOGUS STRUCTURAL EQUATIONS 89
estimates of 03i, namely, —#33/^13 and ~*fr u/fut which in general are
quite different, no matter how ideal the sample. When this happt m
to parameters, we say that they are (or the equation that g@ii&ifi§
them is) overidentified; accordingly, system (63) over deter inirm #31.
Consider now the middle column of (G3). Its four equations
underdetermine the five unknowns 2 i, 1823, 721, 722, 723. An equation
to which such parameters belong is underidentified.
Obviously, then, the identification problem has something to do
with the number of equations and unknowns in the system — F m Bn.
The Counting Rules of Sec. 6.7 will show this more precisely,
6.5. Bogus structural equations
Consider the supplydemand model
SS (Supply) yi + p l2 y 2  u x
DD (Demand) 2 i2/i + 2/2 = u 2 "**'
where y\ represents price and y 2 represents quantity; linear combina
tions of the true supply and demand are called bogus relations and are
branded with the superscript © . A bogus relation may parade either
as supply or as demand,
SS® = j(SS) + k(DD) DD® = m(SS) + n(DD)
where j, k, m, n are unknown numbers, but suitable to make the
standardized coefficients pf lt pf 2 of the bogus relations equal to 1.
The bogus coefficients are connected with the true coefficients as
follows:
Pfi  J + W21 = 1 /3? 2  jfiit + k
0®  m f nfai pf 2 = m/3i2 + n = 1
The bogus supply contains a random term
u?  jui + ku 2
and the bogus demand contains an analogous term
uf = mu\ + nu 2
Later on we shall use the following relations between the eovariances !
90 IDENTIFICATION
of the bogus and the true disturbances:
var uf « j 2 var u x + 2jk cov (u h u 2 ) + k 2 var w 2
var uf = m 2 var U\ f 2mn cov (1*1,^2) + n 2 var u 2 (66)
cov (wf^f ) = jm var Ui f* (jn + m/c) cov (1*1,^2) + ^n var w 2
6.6. Three definitions of exact identification
The discussion that follows is meant to apply to linear models only.
Some results can be extended to other types of models (but not in this
work).
A model or an equation in it may bo cither (exactly) identified, or
underidcntificd, or overidentified. Setting aside for the moment the
last two cases, here are three alternative definitions of exact identifica
tion, one in terms of the statistical appearance of the model, ~) in
terms of maxima of the likelihood function L, and one in terms of the
probability distribution of the endogenous variables.
Definition 1. A model is identified if its structural equations "look
different" from the statistical point of view. An equation looks
different if linear combinations of the other equations in the system
cannot produce an equation involving exactly the same variables as the
equation in question.
Thus the supplydemand model (65) is not exactly identified,
because both equations contain the same variables, price and quantity.
In the model
SS 2/1 + /3i2?/2 + 7ii*i *= ui
DD /?2i</i+ 2/2 =w 2 {p '°
where t\ represents rainfall, a linear combination of SS and DD
contains the same variables as SS itself. Not so for DD, because
every nontrivial linear combination introduces rainfall into the demand
equation. In this model the demand equation is identified, but the
supply is not exactly identified. In such cases the model is not exactly
identified.
Definition 2. A model is identified if the likelihood function L(S)
has a unique maximum at a "point" A = A . This means that, if you
substitute the values «° in L, L is maximal; at any other point L is
definitely smaller. Similarly, an equation is exactly identified if the
likelihood function L becomes smaller when you replace the set a£ of
6.7. A PRIORI CONSTRAINTS ON THE PARAMETERS 91
that equation's parameters by any other set of «J. This way of teking
at the matter is presented in detail later on, in Sec. 6.12.
Definition 3. Anything (a model, an equation, a parameter) is
called exactly identified if it can be determined from knowledge of the
conditional distribution of the endogenous variables, given th© exoge
nous. This is to say, it is identified if, given a sample that wag large
enough and rich enough, you could determine the parameters in
question. We know that, no matter how large the sample or how
rich, we could never disentangle the two equations of (65).
All three definitions appear to say that exact identification is not &
stochastic property, for it does not seem to depend on the samples w©
may chance to draw. We shall return to this question later on.
One must be very accurate and careful about the terminology.
Over, under, and exact identification are exhaustive and mutually
exclusive cases. Identified means "either exactly or o vender* klfkd."
Not identified means "underidentified."
Underidentification occurs when:
By linear combinations of the equations one can obtain a bogus equa
tion that looks statistically like some true equation (Definition 1).
The likelihood function has a maximum maximorum at two or more
points of the parameter space (Definition 2).
Knowledge of the conditional distribution of the endogenous variables,
given the exogenous, does not determine all the parameters of th©
model (Definition 3).
There are three principal ways to avert (or at least to detect)
absence of exact identification: (1) constraints on the a priori values of
the parameters; (2) constraints on the estimates of the parameters;
(3) constraints on the stochastic assumptions of the model.
6.7. A priori constraints on the parameters
Two new symbols will speed up the discussion considerably. Sup
pose we are discussing the third equation of a model. A single asterisk
will denote the variables present in the third equation, a double asterisk,
those absent from the third equation. Asterisks can be attached to
92 IDENTIFICATION
variables, to their parameters, or to vectors of such variables and
parameters. The asterisk notation has now become standard in
econometric literature, and Appendix F gives a detailed account of it.
The commonest a priori restrictions on A are (1) zero restrictions,
like 724 = 0; (2) parameter equalities in the same equation, for
example, 721 = 722; (3) other equations involving parameters of
several equations a 'priori.
These cases have economic counterparts, which I proceed to
illustrate.
Zero restrictions
Zero restrictions are common and handy. A zero restriction says
that, for all we know> such and such a variable is irrelevant to the
behavior of a given sector. If nothing but zero restrictions are
contemplated, then we have a handy counting rule (Counting Rule 1)
for telling whether an equation is identified.
If an equation of a model contains all the variables of the model, it is
underidentified, because linear combinations of all the equations look
statistically just like it. To avoid this underidentification, the follow
ing two conditions are necessary:
1. That some variables (call them #**) be absent from this equation.
2. That the variables (call them x*) present in the equation in
question, whenever they appear in another equation, be mixed with at
least one £**,
In (61) the first equation is identified, because any intermixture of
the second equation brings in variables y** and y** (doublestarred
from the point of view of the first equation), and intermixture of the
third equation brings in y** t which is absent from the first equation.
In (61) the second equation is underidentified, because the third
equation can be merged into it without bringing in any variable that
is not already in the pure, uncontaminated second equation. Finally,
the third equation is identified, because an intermixture of the first
equation introduces zj* and zf*, and intermixture of the second
equation introduces ?/** and z** (the double stars are now from the
point of view of the third equation).
This example shows that underidentification can be detected by*
checking whether given strategic parameters in the model are specified
a priori to be zero or nonzero. This justifies the following statement:
6.7. A PRIORI CONSTRAINTS ON THE PARAMETERS 93
Counting Rule 1. For an equation to be exactly identified it is
necessary (but not sufficient) that the number of variables absent
from it be one less than the number of sectors.
Thus, if G* and H* are, respectively, the number of endogenous and
exogenous variables present in the gth equation, then for the gth
equation to be identified it is necessary (but not sufficient) that
G + HQ3* + Hj) = G  1, or that H  H* « G*  1.
Parameter equalities in the same equation
Another quite common a priori restriction is to set two or more
parameters of a given equation a priori equal. For instance, let us
interpret (61) as a model of bank behavior, where Z\ represents
balances of banks at the Federal Reserve and Zi represents balances of
banks at other banks. It is conceivable that a commercial bank may
conduct its loan policy by looking at its total balances and not at
whether they are held at the Federal Reserve or at another bank.
The restriction would be expressed 721 ~ 722. On the other hand,
some other sector, say, foreign banks, may treat the two kinds of
balances differently, 732 5^ 731. Under these conditions, if the third
equation is intermixed with the second equation, the result cannot
masquerade as the second equation, because the bogus second equation
would have different coefficients for z\ and z 2 , contrary to the a priori
assumption that the response to all balances (Federal Reserve and
other) is identical.
Linear equations connecting the parameters of different equations
Suppose that a model contains a production function and an equa~
tion showing the distribution of national income by factor shares.
Then the coefficient of the share of labor is a priori equal to the labor
coefficient of the production function, on the grounds of the marginal
productivity theory of wages.
Collectively, all the linear restrictions on A discussed so far can be
capsuled into Counting Rule 2. Let A** be what is left of A if we throw
out the columns corresponding to the variables present in the gth
equation.
Counting Rule 2. For the gth equation to be exactly identified it is
necessary and sufficient that the matrix A** have rank G — 1.
94 IDENTIFICATION
These tests and counting rules can (and should) be applied before
you start computations.
There are no convenient counting rules for nonlinear restrictions on
the parameters.
Inequalities such as a > or a > do not help to remove under
identification. For instance, knowledge that demand is downward
sloping and that supply is upwardsloping does not help to identify
the model (65).
6.8. Constraints on parameter estimates
Consider again the supplydemand model (67). It states a priori
that rainfall influences supply and not demand; and this restriction
identifies the demand equation (but not the supply). Now imagine
that you draw an unlucky sample made up of cases where the other
random elements U\ have annihilated the theoretical effect of rainfall.
You will get y ii (for this sample) equal to zero. The sample has
behaved as if rainfall did not influence supply, i.e., as if the model were
reduced to (65), where the demand was statistically indistinguishable
from the supply.
The moral of this is: If you are not a priori certain that supply is
influenced by rainfall (not only theoretically but also in the sample
period) then do not proceed with the estimation of demand. If you
fear that rainfall fails to affect supply (whether in the sample or
generally), then to estimate the demand introduce in the supply
function another variable z 2 (say, last year's price of a competing crop)
that you are certain influences (if ever so little) this year's supply both
theoretically and in the sample period. The new model then is
SS V\ f 0122/2 + TllZl + 712^2 « wi ( „ g .
DD foit/i +2/2 = u 2
and so last year's price takes on the burden that rainfall is supposed to
carry in making the demand identifiable.
A very neat extension of Counting Rule 2 covers all these require
ments: For exact identification , the ranks of A** and A** must equal
G  1.
We can show this in a third way. 1 If, in the original model (67),
1 With acknowledgments to T. C. Koopmans, in Hood, pp. 3132.
6.9. CONSTRAINTS ON THE STOCHASTIC ASSUMPTIONS 95
Yn is truly nonzero, then it is impossible to construct a bogus demand
equation without detecting it. Take as the bogus demand
DD® = %DD + %SS
Then the bogus random term of demand is
© 2ui + u 2 7n .
Then cov (uf,z\) is not zero, and will show up in estimation, unless
the sample is the unlucky one in which rainfall is neutralized (f n ~ 0)
by the random factor. If, upon completing the estimation, we
discover that mj,.,, is quite different from zero, then we can detect
underidentification but we cannot remove it. On the other hand, the
discovery that w;,.,, is nearly zero is no guarantee that we have
identified demand if there is a strong reason to suspect that supply is
unaffected by rainfall.
6.9. Constraints on the stochastic assumptions
Let the random terms of (65) satisfy Simplifying Assumptions
1 to 7, so that cov (ui f u 2 ) = 0. Will this help to identify the supply
SS? Sometimes. Suppose that we knew beforehand that var u\ 9
cov (ui,u 2 ), var u 2 were of the orders of magnitude 3, 0, 10, respec
tively. The "deception" can be detected from (66) if 2(jB? ) 2 , which
is the estimate of var uf, is very different from 3. This can have
happened by chance in the sample used, but it becomes more and
more unlikely the more 2(j2?) 2 differs from 3. On the other hand,
the bogus variances and covariances may have nothing peculiar about
them — indeed they may equal 3, 0, and 10, respectively, because of a
special set of values that j, k, m, and n have taken on, for example,
J — %> h = }4> m ■■ — M> n — K Therefore, in general, there is
no guarantee that SS® will look statistically different from SS, even
if we have complete knowledge of the underlying covariances of the
random term.
Another way to impose identification on a model is to say something
specific about the variances of the random terms. This was done by
96 IDENTIFICATION
Schultz in some early studies of agricultural markets. 1 In some of
Schultz's work, both supply and demand are functions of the same two
endogenous variables (price and quantity) and of random shocks.
However, supply is more random than demand. Then the scatter of
observed points will be more in agreement with the demand than with
the supply function. Ambiguity is not eliminated entirely, but it is
reduced as the randomness of supply increases relative to the random
ness of demand. In the notation of (G6), the restriction takes the
form var U\ ■ q var w 2 ; and identification improves with increase in q.
More complex restriction of this kind could also help.
To summarize the results of Sees. 6.7 to G.9:
1. Identification can be checked before computing by use of the
Counting Rules as applied to A.
2. If you fear that art equation is underidentified because you are not
sure whether a given variable x reacts significantly, estimate the equa
tion anyhow and then check whether the covariance of x with the
residual % is near zero; if not, you may have identified the gth equation.
If m x .u a is near zero, you have not identified your equation. If the
numerically largest determinant of rank G — 1 from A** is close to zero,
X probably did not play a significant role.
3. There are tests that help detect underidentification.
4. It is sometimes possible to remove underidentification.
6.10. Identifiable parameters in an underidentified
equation
When an equation is underidentified, is it perhaps possible to
identify one or more of its parameters, though not all? For instance,
what about the identifiability of 7n in (67) ? Intuition says that 711
cannot be adulterated by linear combinations of DD, since Z\ occurs
only in the supply SS, Intuition is wrong if it concludes that this
fact makes 711 identifiable. Applying (64), we havo
— 7ll = 7Tll f* 0127T21
The TS can be computed from the reduced form
1 Henry Schultz, The Theory and Measurement of Demand, pp. 7281 (Univer
sity of Chicago Press, Chicago: 1938).
6.10. IDENTIFIABLE PARAMETERS IN UNDERIDENTIFIED EQUATION 97
2/i » vnZi + Vi
Vl «■ 7T2121 + Vf
but /3i2 is and remains unknown and, therefore, so does ?n.
So, contrary to intuition, the fact that a given variable enters one
equation of a model and no others does not make its coefficient identi
fiable. Underidentification is a disease affecting all parameters of
the affected equation. For, if the gth. equation is unidentified, this
means that there are fewer equations than unknowns in the gth row of
formula (64). All coefficients of the ^th equation enter (64) sym
metrically, and so none can have a privileged position over the others.
Let us now ask whether we can identify, in an otherwise unidenti
fiable equation, the ratio of two unidentifiable coefficients. In special
cases it may be both important and sufficient to know the relative
rather than the absolute impact of two kinds of variables. Let us
consider (68) as a model of the supply and demand for loans, where
?/i is quantity of loans, 7/2 is interest rate, Z\ is balances at the Federal
Reserve, and zi is balances at foreign banks. We are curious to know
whether the two kinds of bank balances differ in their effects on the
loan policy of a commercial bank. Is it possible to identify t\\/y\%!
No, because (64) applied to this model yields
""Til = Til 4" 012^21
— 712 = 7Ti2 + 0127T22
which cannot be solved for 711/712 so long as 0i 2 is unidentified. The
most we can get is the relation
7u 4 frn __ Y12 f ^12
7T21 7T22
which is a straight line in the 711,712 space, giving an infinity of pairs
(7n, Y12).
Exercises
6.B Derive explicitly the equations — r = Bn for (68).
6.C In the above exercise, compute the two values of 0n in terms
of the coefficients of the reduced form. Under what arithmetical
conditions would they be identical? Interpret this in economic terms.
98 IDENTIFICATION
6.11. Source of ambiguity in overidentified models
Let us return to (6S), rewriting it for convenience
SS q + Pup + ynr + 7nC  u x (Q Q .
DD n q + V  w 2 K ~ }
where q ,«■ quantity, p ■* price, r = rainfall, c — last year's price of a
competing crop. Supply is underidentified, and demand is over
identified. For the latter we get from the reduced form two incom
patible estimates of the single unknown /3 2 iJ
a' — ~ *21 off _ ~^22
P21 — . "^ — ■ P21 5 —
7Tn 7T12
But why should the reduced form, if estimated by least squares, give
two values for 2 i> the price elasticity of demand? The answer is in
terms of the wobblings of the supply function. In (69), supply
wobbles in response to random shocks u\ and to two unrelated exoge
nous variables, this year's rainfall r and last year's price c of a compet
ing crop. In Fig. 10a I have drawn some supply curves corresponding
to different amounts of rainfall (+1, —1) for a fixed value of c (= 0).
Observable points fall in the parallelogram ABCD. On the other
hand, in Fig. 106 the variations in supply come not from rainfall (which
is held constant at 0) but from last year's price only. Observations
fall in the parallelogram EFGH. The first estimate of (3
_ ftl .*!!.. *te*±* ( 6 .10)
#11 ™>(q.c)(r,c)
corresponds to the broken line in Fig. 10a, because #21/^11 correlates
price and quantity reactions as they result from variations in rainfall
only. The other estimate
off __ #22 m( r , p ).(r, c ) /n in
—P21 — r~ — \pLi)
#12 M>(r,q)'(r,c)
corresponds to the broken line in Fig. 106, because it correlates p to q
as a result of variations in last year's price alone. 1 The sample must
be very peculiar indeed that yields equal estimates ff n and £&.
1 In expressions like (610) and (611), the heuristic device of canceling the
"factors" c and r in numerator and denominator gives a correct interpretation
of what is being correlated, provided that these "factors" appear on both sides
of both dots.
6.11. SOUBCE OF AMBIGUITY IN OVERIDENTIFIED MODELS
99
The explanation, then, is at bottom simple: When demand is
overidentified, this means that both rainfall r and lagged prim c rrmke
the supply shift up and down, and trace the demand relationship for us.
The original form of the model shows this. The reduced form, how
ever, does not allow us to trace the demand uniquely, as the result of
the combined effect of rainfall and lagged price. Rather, the reduced
form gives us a choice of estimating the slope of the demand equation
either as a result of rainfallinduced variations in supply or m a result
of laggedpriceinduced variations in supply. Essentially, then, either
alternative leaves out some crucial consideration, namely, the fact that
sA
^^^V
■>JK1r
^^^
J^^Bj^
yPS"
*X^<_^ Slope %
— Mg
(a) (ft>
Fig. 10. Ambiguity in an overidentified equation.
the omitted variable (lagged price and rainfall, respectively) also
affects the price and quantity combinations that the sample shows.
To show that (610) is a biased estimate, write p = u% — @n$> Then
W(p,c).(r,c) = m>(u t ,c)>(r,c) ~ p21™>(q,ch(r,c), and SO
"•021 = ~021 +
m( Ut ,eHr,e)
W(,, c ).(r,e)
The expected value of the bias term is not zero. This is easily seen
from (69). Let r, c, and wi be fix^d, and let u* take on a set of con
jugate values 4W2 and — u 2 ?* 0. Then, in (69), q necessarily takes on
two different values q' and q", and thus the above denominator changes
as ui takes on its conjugate values. Therefore, m(+u„eMr,0)/ftl(c',«Kr.«)
and W(_ Ul , c ).( ri< j)/m( a '\ C ).( r , C ) do not add up to zero. To show that
100 IDENTIFICATION
fi' n is a consistent estimate, consider that flim m< tt , C ).(r.e) ■=• but
Plim m( fllC ).( r , ) ?* 0.
Exercises
6.D If it turns out that #' 21 = /5ji, the sample moments must
satisfy either the equation m rr m cc = m re m re or the equation
m pc m qr = m pr m qe . The first of these declares that rainfall and last
year's price are perfectly correlated in the sample. Interpret the
second one. Hint: Use the fact that p = u 2 — fog.
6.E If least squares are applied to the reduced form, obtaining
tFs, prove the following: (1) that all parameters £, y that can be esti
mated by working back from the reduced form are consistent; (2) that
all 7s that can be estimated (whether uniquely or ambiguously) are
in general biased; (3) that all /3s that can be estimated (whether
uniquely or ambiguously) are in general biased.
In the following exercises, p (price) and q (quantity) are the endoge
nous variables. The exogenous variables are i (interest rate), / (liquid
funds), and r (rainfall).
6.F Show that in
SS q 4 pup = u (exactly identified)
DD 02i# 4 p + y n i *= v (underidentified)
only 0i2 can be estimated unambiguously.
6.G From the model
SS q f Pup 4 7nr = u (overidentified)
DD j5 2 i? 4 p 4 72i« + 722/ = v (exactly identified)
the reduced form leads to the following estimates of £12:
at ' *n oft _ *«
#21 7T22
where § u ■■ m( 7 ,/, r ).(v,/, r )/m(.,/, r ).( t ,/, r )and7F 2 i = wi( P ,/,r).(» , ,/,r)/w«,/,r).(» , ,/.r).
Show that these estimates are biased and consistent.
6. II In Exercise 6.G, find the bias (if any) for ^13, foi, ^22.
6.12. Identification and the parameter space
The likelihood function L(S) may or may not have a unique highest
maximum as a function of the parameter estimates A. If it does, the
model is (exactly or over) identified.
Fig. 11. Maxima of the likelihood function, a. Underidentificdj b, exactly
identified; c. overidentified.
101
102 IDENTIFICATION
Along the axes, labeled a and in Fig. 11, let me represent the
parameter space. Usually this space has mdre dimensions, but I
cannot picture these on flat paper.
Underidentification is pictured in Fig. 11a. Here the mountain has
either a flat top T or a ridge RR f , or both. Its elevation is highest in
many places rather than at a single place; i.e., there are many local
maxima. This means that several values of a and /3 are candidates for
the role of estimates of the true a and /3. In the picture these can
didates lie in the cobralike area PP'Q that creeps on the floor.
When the system is (exactly or over) identified, nothing of this sort
happens. The mountain has a single highest point. If the system is
exactly identified, this fact is the end of the story, and Fig. 116 applies.
When the system is overidentified, then Fig. 1 lc applies. The moun
tain in Fig. 1 lc is the same as in Fig. 116, but we have several conflicting
ways to look for the top. One estimating procedure allows us to look
for the highest point of the mountain along, say, the 38th parallel;
another equally admissible procedure tells us to look for it not
along the 38th parallel but along the boundary XY between area
I and area II. Accordingly, we get P' and P", two estimates of P that
correspond to j3' 21 and flj'i of equations (610) and (611).
6.13, Over and underidentification contrasted
The example in the figure suggests that overidentification and
underidentification are not simple logical opposites, except in a very
trivial sense — in relation to the Counting Rules. Table 6. 1 gives the
contrasts among over, exact, and underidentification.
We say that underidentification is not usually a stochastic property,
because it arises from the a priori specification of the model and not
from sampling, and so it cannot be removed by better sampling.
Stochastic underidentification is in the nature of a freak; it was
illustrated in Sec. 6.8. On the other hand, overidentification is a
stochastic property that arises because we disregard some information
contained in the sample. Overidentification is removed if all the
information of the sample is utilized — which means that reducedform
leastsquares estimation must be abandoned.
6.14. CONFLUENCE
103
Table 6.1
Degree of identification
Underidentifi
Exact
OvERIDENTIF!
CATIDN
IDENTIFICATION
CATION
Unique maximum of
the likelihood func
tion
Does not exist
Exists
Exists
A priori restrictions for
locating single high
est point
Not enough
Enough
Too many
Ambiguity, if any,
introduced because:
You have not
enough inde
pendent varia
tion in supply
and demand
No ambiguity
In reduced form
you disregard one
or another eauae
in the variation
of supply
Estimate of tho
parameters if
based on re
duced form
/3s
Biased, con
sistent j
Biased, con
sistent
Biased, consistent
78
Biased,* con
sistent.
Biased,* con
sistent
Biased,* consistent
Is the degree of identi
fication a stochastic
property?
Not usually; yes,
if in fact a
variable fails
to vary
Yes
In special cases, unbiased.
6.14. Confluence
Multicollinearity and underidentification are two special cases of a
mathematical property called, confluence.
Multicollinearity arises when you cannot separate the effects of two
(or more) theoretically independent variables because in your sample
they happen to have moved together. This topic is taken up again
in Chap. 9.
To show the connection between underidentification and multi
collinearity, I shall use a model adapted from Tintner, p. 33, which
contains both.
104
IDENTIFICATION
Suppose that the world price of cotton is established by supply and
demand conditions in the United States. Let the supply of American
cotton q depend only on its world price p, while the American demand
for cotton depends both on its price and on national income y.
DD q  ap + #?/ f u
SS qyp + v
(612)
How, demand in this model is underidentified. If the sample comes
Fig. 12. Confluence.
from earlier years, when cotton was king, then the model, in addition,
suffers from multicollinearity, because the national income was strongly
correlated with the price and quantity of cotton. In the parameter
space for a, £, and 7, the likelihood function L has a stationary value over
a region of the space afiy. To picture this (Fig. 12) let us forget 7 — or
assume that someone has disclosed it to be f0.03. The true value of
the parameters is the point (a,/?,0.03). The ambiguous area, over
which L has a flat top, is the band PQRS in Fig. 12. If a sample is
taken from more recent years, multicollinearity is reduced, because
national income y and the world price of cotton p are no longer so
6.14. CONFLUENCE 105
strongly correlated as before. In Fig. 12, the gradual diversification of
America's economy would appear as a gradual migration of the points
in the band PQRS toward a narrower band around the curve MN. If
the time comes when cotton becomes quite insignificant, then multi
collinearity will have disappeared, but not the underidentiflcation.
In the figure, the band will have collapsed to the curve M 2V, but not
to a single point.
Exercise
6.1 Suggest methods for removing multicollinearity and discuss
them.
Digression on the etymology of the term
"multicollinearity"
Suppose, for purposes of illustration only, that in (612) the
true values of the parameters are simply a = — 1, = 1, 7 » 1.
Also suppose that national income and the price of cotton are
connected by the exact relation
V = 3p
(The exactness is for illustrative purposes only. What follows is
also true for the stochastic case y = 3p J w.) Then the demand
can be written
DD q = 2p + u
Now, the following estimates of a and are consistent with
all observations:
1. The true values a = — 1, = 1
2. The pair of values a == 1, = }4 — because
q = lp + %y 4 u = 2p f u
3. The pair of values a = 2, j3 = 0; and an infinity of other
pairs, which can be represented as the collinear points of line AB
in Fig. 13
If we take a bogus demand function, then its parameters a©,
/3® also form a collinear set of points — like line CD — that agrees
with the sample.
Removing multicollinearity causes EF to collapse into M , AB
106
IDENTIFICATION
into N, and CD intoP; that is to say, removing multicollinearity
collapses the band between the lines EF and CD into the line
MNP. On the other hand, removing underidentification col
lapses the same band into the line AB.
Nomu!tfcol!inearity^ v P
C
Fig. 13. Multicollinearity.
Further readings
Koopmans, chap. 17, extends and refines the concept of a complete model.
The excellent introduction to identification, also by Koopmans, was cited
in Sec. 6.1. Klein gives scattered examples with discussion (consult his
index).
It is worthwhile to read the seminal article of Elmer J. Working, " What Do
Statistical 'Demand Curves' Show?" (Quarterly Journal of Economics , vol. 41,
no. 2, pp. 212235, February, 1927), both for its contents and in order to
appreciate how far econometrics has progressed since that time.
Trygve Haavclmo's treatment of confluence, in "The Probability Approach
in Econometrics" (Econometrica, vol. 12, Supplement, 1944), is hard going,
but an excellent exercise in decoding. If you have come this far, you can
tackle this piece.
Those who appreciate the refinement and proliferation of concepts and
are not afraid of flights into abstraction may glance at Leonid Hurwicz,
"Generalization of the Concept of Identification," chap. 4 of Koopmans.
Herbert Simon, "Causal Order and Identifiability," in Hood, chap. 3,
shows how an econometric model can be analyzed into a hierarchy of sub
models increasingly more endogenous and how the hierarchy accords with
the statistical notion of causation and that of identifiability.
CHAPTER 7
Instrumental variables
7.1. Terminology and results
The term instrumental variable in econometrics has two entirely
unrelated meanings:
1. A variable that can be manipulated at will by a policy maker as a
tool or instrument of policy; for instance, taxes, the quantity of money,
the rediscount rate j
2. A variable, exogenous to the economy, significant, not entering
the particular equation or equations we want to estimate, nevertheless
used by us in a special way iiji estimating these equations
In this work only the second 'meaning is used.
This chapter explains and rationalizes the instrumental variable
technique. It shows: j
1. That the technique, though at first sight it appears odd, is
logically similar to other estimating methods and also quite reasonable
2. That, if the choice of instrumental variables is unique, the model
is exactly identified, and that the instrumental variable method is
equivalent to applying least; squares to the reduced form and then
solving back to the original form
107
108 INSTRUMENTAL VARIABLES
To understand the logic of instrumental variables we must first take
a deep look at parameter estimation in general.
7.2. The rationale of estimating parametric relationships
Two ideas dominate the strategy of inferring parametric connections
statistically. The first idea is that variables can be divided into
causes and effects. The second is that conflicting observations must
be weighted somehow.
Causes and effects
There can be one or more causes, symbolized by c, and one or more
effects, symbolized by c. Various "instances" or "degrees" of c and e
will carry a subscript. A parameter is nothing more than the change
in effect (s), given a change in cause (s). Symbolically this can be
represented as follows:
~ . change in effect (s) Ae , 1N
Parameter = — — r ^ rr = t (71)
corresponding change in cause (s) Ac
This relation 1 is fundamental in Chaps. 7 and 8; the general theme of
these chapters is that all techniques of estimation are variations and
elaborations of (71).
The change in cause (s) and effect (s) can be any change whatsoever.
For instance, the change in effect e may be e\ — 62, or €2 — Ci, or
0i5 — 026? in general, e t — e*. The corresponding changes in cause c
are d — c 2 , and c 2 — Ci, and C15 — 025; in general, c t — ce.
Usually, however, the change is computed from some fixed reference
level of the effect (s) or the cause (s) — and this fixed level is most
typically the mean. 2 So parameters are usually computed by a
formula like (72) rather than (71).
Parameter estimate = ■ = ft (72)
Ct — c N
1 It is meant to be a conventional relationship. Only in the simplest linear
systems is it true that the numerical value of a parameter can be expressed as
simply as in equation (71).
9 Ideally, the mean of the population. In practice, the mean of the sample.
In linear models the distinction is immaterial for most purposes.
7.2. THE RATIONALE OF ESTIMATING PARAMETRIC RELATIONSHIP! 109
This is merely a convenience, which does not affect the logic of par&me
ter estimation. Henceforth, all symbols Ac, Ac, e tf c h or dimply e
and c represent deviations from the corresponding mean.
The problem of conflicting observations
What happens when two or more applications of (72) give different
values for x? This is very likely to happen in stochastic models,
because in such models the effect e results not merely from the explicit
cause c but also from the disturbance u. Which of the several con
flicting values should be assigned to the unknown parameter *■?
The problem arises in all but the simplest cases of parameter
estimation. I
In general, the parameter estimate is a weighted average of quotients
of the form e/c. Take the model e t — yc t 4 u t . Then the weighted
estimate of 7 is 1
7 = — wi + — W2 + • • • H — u>i
C\ ,C2 c,
Any set of weights Wi, W2, . . J , w, will do, provided they add up to 1.
If you want to attach much significance to the instance ew/cw, make
wn large; if you want to disregard the instance 627/027, make Wn m
or even negative. 
Let us for simplicity restrict ourselves to just two observations.
One of the many possible sets of weights is the following:
ilL w m A
~r : C2 C\ T* Cj
u>i  irxb ^2  :txt« C?3)
Then, using these weights, the weighted estimate is also tha familiar
least squares estimate, because
Ci c\ , 62 j c\ e\C\ + C2C2 ___ tn w
Ci cf + c c 2 c[ + c§ " c f c "" ?/l<*
So the least squares estimate amounts to nothing more than a
special method for weighting; conflicting values of the ratios e/c.
Now two questions arise: (1) Why should the weights (73) be functions
of the c's or have anything at all to do with the cause c? and (2) why
M
c 2
\o'\
c {
Vki
log c 2
Sc
Zc 2
2c 3 
Xo*
s VR
2 log c 2
110 INSTRUMENTAL VARIABLES
should we pick those particular formulas and not, for example, absolute
values
Wl = N „, 2 = N ( 7^)
ci] 4 \Ci\ \ci\ + cs v '
or cubes, square roots, or logarithms?
The answer to question 1 is that the more strongly the cause departs
from its average level, the more you weight it. It is as though we said
that the real test of the relationship e t = yc t + u t is whether it stands
up under atypical, nonaverage conditions (i.e., when c t is far from c).
But now why should one [as formulas (73) and (74) suggest] give
equal weight to +c t and — c<? The commonsense rationale here is
that the same credence should be given to a ratio e/c when c is atypically
small (relative to its mean) as when it is atypically large. This
requirement is met by an evenly increasing function of c, for instance:
(75)
The answer to question (2) is this: From the many alternative
formulas in (75), c 2 /2c 2 is selected because of the assumption that
ti is normally distributed, in which case least squares approaches
maximum likelihood. A different probability distribution of the
disturbances would prescribe a different set of weights for averaging
or reconciling conflicting values of e/c.
7.3. A single instrumental variable
With these preliminaries well digested, we are ready to understand
the logic of instrumental variables. Suppose that
Vt  fat + u t (76)
is a model for the demand for sugar. This equation is known to have
come from a larger system in which p (price) and q (quantity) are
endogenous and a great many other causes are active. We call (70)
the manifest model and all the remaining relationships the latent model.
Suppose that z represents some exogenous cause affecting the economy,
say, the tax on steel. Now in an interdependent economic system the
tax on steel affects theoretically both the price of sugar and the quantity
7.3. A SINGLE INSTRUMENTAL VARIABLE 111
of sugar bought, because a tax; cannot leave unaffected either the price
or the quantity, of any substitute or competitive goods, and sugar surely
must be somewhere at the end of the chain of substitutes or com
plements of steel. !
The method of instrumental variables says that you ought to
compute an estimate of y (from two observations 8 = 1, 2) as follows:
(77)
(78)
y ** fir Wl + t* w *
$1 Q2
using as weights
1
9i*i '■
w*  qiZ *
W\ —
9i*
1 4" 9222
W2 .
qiZi + qzZi
so that the estimate of y
is 1
r _ Pi 9i*i
qi q\Z\ + 52*2
+ P2
92 \q\Z
q&2 __ V\Z\ + P2^2 t
1 + 92*2 9i*i + 92*2
3 2*5
Every ounce of common sen&e in you ought to rear itself in rebellion
at this perpetration. You ought to protest, saying: "Nonsen.^e!
My boss at the Zachary Sugar Refinery will fire me unceremoniously
from my wellpaid and respected job of Company Economotrician if I
tell the VicePresident that I multiply the price and quantity of sugar
with the tax on steel to estimate demand for sugar! Better give me a
good argument for this act of alchemy. Moreover, did it ever occur
to you that m zq in the denominator could conceivably be equal to zero—
and certainly is in some samples? 1 This predicament could not arise in
leastsquare weights like (73). "
I hasten to reply to the last point first. The possibility that m q ,
might be zero is the reason why z should not be chosen haphazardly but
rather from exogenous variables that have a lot to do with the quantity
of sugar consumed. So, instead of the tax on steel, perhaps we ought
to take the tax on coffee, honey, or sugar, or the quantity of school
lunches financed by Congress. Still, it is possible that, in the sample
chosen, the quantity of sugar q and the quantity of school lunches z
happen to be uncorrected, and it is true that this sort of difficulty m
unheard of in least squares, because m qq j ust cannot be zero : the quantity
of sugar is perfectly correlated with itself.
1 This sample would be a set of measure zero, as the mathematicians say*
112
INSTRUMENTAL VARIABLES
Now what do the weights (79) say? They say that, the more sugar
consumption and school lunches move together, 1 the more weight
should be given to Ap/Aq, the price effect of the change in the quantity
consumed. There is another way to look at this matter: Write,
purely heuristically,
Ap/Az
8  ^ =
Aq Aq/Az
(710)
where Ap/Az is symbolized by 71 and Aq/Az by 72. How could we
possibly interpret 71 and 72?
Figure 14 represents both the latent and the manifest parts of the
S
v'
/
/
X
\
*2 X
\
Fig. 14. The logic of the instrumental variable.
model (70). The solid arrow represents the manifest mutual causa
tion p «» q f which appears in (76). The broken arrows represent the
latent model, which is not spelled out but which states that z affects
p and q through the workings of the whole economic system. So the
meaning of 71 and 72 is that they are the coefficients of the latent model.
Since z is exogenous, why not estiniate 71 and 72 by least squares? It
sounds highly reasonable. There should be no objection to this.
?i ==
m t
m t
Insert (711) in (710), and then
* _ Ap/A z
P Aq/Az
7i
72
ntzq
72 = — 
m Z p/m z
m tp
m zq
m zq /m t
which justifies the instrumental variable formula (79).
1 In a given instance, not overall.
(711)
(712)
7.4. CONNECTION WITH THE REDUCED FORM 113
It is well to ask now: in what way do the two examples p ■ fiq 4 u .
and e = yc f* w differ? Why may the second model be computed by
least squares although the first has to be estimated by instrumental
variables? The reason is that the second c — > e is a complete model in
itself: causation flows from cjto e, and nothing else is involved. Oa
the other hand, p = Pq + w hides a lot of complexity, i.e., causation
from z to p and from 2 to q. These causations are unidirectional
z *p and z * 5 and can be treated like c — > e; but p «» g is not of the
same kind, and must be treated by taking into account the hidden
part.
7.4. Connection with the reduced form
The instrumental variable technique is very intimately connected
with the method of applying least squares to the reduced form.
Assume for the moment that $ is the only exogenous variable affecting
the economy and that the complete model is
I
P — Pq = u
1 L 1 (713)
— P t — *q — z — v
whose first equation is manifest (the solid arrows in Fig. 14). The
second equation is latent and corresponds to the broken arrows.
: Exercise
7. A Explain why the complete model cannot contain three inde
pendent equations, one for p «;» q, one for z * p, and one for z ••* q.
The reduced form is
I
Dp == 0z H u + @v
\* (714)
Dq =■ z tt+ v
I ?*
i
where D is the determinant I/72 + 18/72.
If the second equation in (713) contained another exogenous
variable, say, 2', then the first iequation would be overidentified. This
fact would be reflected in the' instrumental variable technique as the
114 INSTRUMENTAL VARIABLES
following dilemma: Should we use z or z' as the instrumental variable?
On this last point I have more to say in Sees. 7.5 to 7.7.
7.5. Properties of the instrumental variable technique
in the simplest case
The technique is biased and consistent, where naive least squares
is biased and inconsistent.
Bm*!sfi + 2z> m fi + 1 _J22 — _ — (715)
rn tq m, q r Z) ra„  (l/yi)m gu + m zv v '
^ g , _1 ^iu — (1/Tl)^uu + Wvu
D m it  (2/yi)m tu + 2m tv + (l/y\)m uu  (2/7i)m rtt + m vv
(716)
Under the ordinary Simplifying Assumptions, <r tu = <r„ = ow = 0;
so, for large samples, the first expression approaches 0, and the second
approaches
Expression (715) gives us additional guidance for selecting an
instrumental variable. To minimize bias, the following conditions
should be fulfilled, either singly or in combination:
1. m att numerically small
2. m„ numerically large
3. D numerically large
The first condition says that z should be truly exogenous to the sugar
market. Appropriations for school lunches are better in this respect
than the tax on sugar, because the tax might have been imposed to
discourage consumption or to maximize revenue, in which case it
would have some connection with the parameters and variables of the
sugar market.
The second condition says that, in the sample, the instrumental
variable must have varied a lot: if the tax on sugar varied only trivially
it had no opportunity to affect p and q significantly enough for us to
7.6. EXTENSIONS i 115
capture /3 by our estimating procedure. From this point of view the
tax on sugar might be less desirable as an instrumental variable than
some remote but more volatile? entity, say, the budget's appropriations
for U.S. Information Service, j
The third condition says that D = l/? 2 + 0/7 1 should be numeri
cally large; that is, that 71 and 72 should be numerically small relative
to (3. This says that, to minimize bias, p and q should react more
strongly to each other (in the manifest model) than to the instrumental
variable in the latent model, j It requires that price and quantity be
more sensitive to each other in the sugar market than to such things
as the U.S.I.S. budget, the tax on honey, or, for that matter, the tax
on sugar itself.
It is not easy to find an instrumental variable fulfilling all
conditions at once. However, if the sample is large, any sort of.
instrumental variable gives better estimates than least squares.
7.6. Extensions
1
!
The instrumental variable; technique can be extended in several
directions: I
1. The singleequation incomplete manifest model may con
tain several parameters to be estimated. For example, the model
V = 0i<Z + $iV + u requires two instrumental variables z\ and z*.
The estimating formulas are analogous to (79) :
S = m <'i''«>iMrt g __ m (*..« t )(g.p) (717)
All the criteria of Sec. 7.5 fo:: selecting a good instrumental variable
are still valid, plus the following* z\ and z 2 must really be different
variables, that is, variables not well correlated in the sample; else the
denominators approach zero, and the estimates &, £ 2 blow up.
Exercise
7.B If we wish to estimate the parameters of p — fiq + yz + u,
where z is exogenous, is it permissible for z itself to be one of the
instrumental variables z\ and 2 2 ?
2. The incomplete manifest model may consist of several equations,
116 INSTRUMENTAL VARIABLES
for instance:
021? 4" ? + 72« = W 2
Each equation can be estimated independently of the other, using
formulas analogous to (717). Variable z itself and another variable z x
may be used as instrumental variables in both equations, or two
variables %\ % z 2 completely extraneous to the manifest model may be
used.
7.7. How to select instrumental variables
In some instances we may have several candidates for the role of
instrumental variable. The choice is made anew for each equation
of the manifest model, and the rules are:
1. If several instrumental variables are needed, they should be
those least correlated with one another.
2. The instrumental variables should affect strongly as many as
possible of the variables present in the equation that is being estimated.
Choosing instrumental variables is admittedly arbitrary. Another
statistician with the same data might make a different choice and so
get different results for the same model. The technique of weighting
instrumental variables eliminates some of this arbitrariness. I illustrate
the technique for the singleequation, singleparameter demand model
p = @q + u. Suppose that two exogenous variables are available,
z i (the sugar tax) and z<i (the tax on honey), and that both affect p and
q. To select z\ or z 2 is arbitrary. The new variable z = Wiz x + w 2 Z2,
a linear combination of the two taxes with arbitrary weights W\ and w 2 ,
is less arbitrary because both taxes are taken into account. 1
Results improve considerably if we take W\ } w 2 proportional to the
importance of the two taxes on the sugar market. Naturally, to
estimate the parameters of the sugar market, the weight given the
sugar tax should be greater than that given to the tax on honey; and
vice versa when we want to study the honey market. In general, we
ought to rank the instrumental variable candidates z h z 2 , Zz, * . . in
order of increasing remoteness from the sector being estimated and
1 This treatment with w x « u> 2 » 1 coincides with Theil's method with k m 1.
Consult Chap. 9 below.
7.7. HOW TO SELECT INSTRUMENTAL VARIABLES 117
assign them decreasing weights in a new instrumental variable
z = w&i + w&2 + w&z +••'.•• The more accurate the a priori
information by means of whioh weights are assigned, the more does
this technique approximate 5he results of the full information maxi
mum likelihood method, discussed in Chap. 8.
Exercises
Warning: These are difficult!
7.C Prove or disprove the conjecture that weighted instrumental
variables are "better" than unweighted. Use the model p = Pq f u,
(l/yi)p + (I/72)? — to — $i;Z 2 — v, where y h y it 6 h 5 2 measure the
sensitivity of p and q to Z\ and z 2 . Define W\ and tt? 2 as 6i/(5i f 5 2 ).
Define
m *tQ
ftliwiZi+Wttjq
Prove e{/Sf(*i)  Hfri)]) 1 > «WW  ttfMll 1 < «WW  «(£<*«>] I f 
7.D Prove or disprove the conjecture that the goodness of the
weighted instrumental variably technique is insensitive to small depar
tures of W\ t w t ^om their ideal relative sizes 5i/(5i + 62), 62/(61 + 5 2 ).
CHAPTER 8
Limited information
8tl« Introduction
Limited information maximum likelihood is one of the many tech
niques available for estimating an identified (exactly or overidentified)
equation. Other methods are (1) na'ive least squares, (2) least squares
applied to the reduced form, (3) instrumental variables, (4) weighted
instrumental variables, (5) TheiFs method, 1 and (6) full information.
Method 1 is biased and inconsistent; the rest are biased and con
sistent. They are listed in order of increasing efficiency. Limited
information leads to burdensome computations, but is less cumbersome
than full information. Unlike full information but like all other
methods, limited information can bo used on one equation of a model
at a time. Limited information differs from the method of instru
mental variables in two ways: it makes use of all, not an arbitrary
selection, of the exogenous variables affecting the system; it prescribes
a special way of using the exogenous variables. If an equation is
exactly identified, limited information and instrumental variables are
equivalent methods. Like all methods of estimating parameters,
limited information uses formulas that are nothing more than a
1 Discussed in Chap. 9.
118
8.2. THE CHAIN OF CAUSATION 119
glorified version of the quotient
Change in effect (s)
Corresponding change in cause(s)
I shall illustrate by the example of (81), where the first equation ii l@
be estimated by the limited information method. The rest of !h§
model may be either latent or manifest. The limited information
method ignores part of what goes on in the remaining equations by
deliberate choice, not because they are latent (though, of course, they
might be). However, in (81) the entire model is spelled out for
pedagogic reasons. The minus signs are contrary to the notations!
conventions used so far but ssre very handy in solving for ff\, i/s» y%»
Nothing in the logic of the situation is changed by expressing B and T
in negative terms.
2/i  ft/2  7*1  Wi
2/2  0232/3  72^2  723^3  78*24 * «2 (84)
~ 0312/1 f 1/3  731*1  734^4 « U%
As usual, a single asterisk distinguishes the variables admitted in
the first equation and a double asterisk those excluded from it. Thus
y* = vec (2/1,2/2), z* = vec (zj, y** ■■ vec (1/3), z**  vec (zi,fif«)..
We apply the limited information method in two cases:
1. When nothing more is known about the economy than that some
how z 2 , Z3, Zi affect it
2. When more is known bub this is purposely ignored
8.2. The chain of causation
Let the first equation be the> one we wish to estimate, out of a model
containing several. The chains of causation in a general model of
several equations in several unknowns are shown in Fig. 15. The arrow
heads show that causation flows from the z's to the y'a but not baok>
and mutually between the y's. Solid arrows correspond to the ?&$%
equation, broken arrows to the rest of the model. The two lefthand
rounded arrows, one solid, one broken, show that the y*'a (the endoge
nous variables admitted in the manifest model) interact both in the
first equation and (possibly, tDo) in the rest of the model. The right
120
LIMITED INFORMATION
Fig. 15. Causation in econometric models.
Fig. 16. Chain of causation in the special model (81).
hand rounded arrow shows that the y**'s (the endogenous variables
excluded) interact, but, naturally, only in the rest of the model.
Parenthetically, the crinkly arrows symbolize intercorrelation among
the exogenous variables. Ideally, the exogenous variables are uncon
nected, but in any given sample they may happen to be intercorrelated.
This is the familiar problem of multicollinearity (Sec. 6.14) in its
8.3. THE RATIONALE OF LIMITED INFORMATION 121
general form. The stronger the; correlation between one z and another,
the less reliable are estimates of the 7s because different exogenous
variables have acted alike in ithe sample period. We shall ignore
multicollinearity and continue with the main subject.
Figure 16 shows the chains of causation in (81). The variable %%
affects ?/i and t/ 2 in the first equation and y\ and yz in the third.
Variables z 2 , 23, zk affect t/ 2 and v/ 3 in the second, and z 4 affects 7/1 and ijt
in the third. We can make the arrows between the y's single rather
than doubleheaded because there are as many equations as there are
endogenous variables. Thus tlie model can be put into cyclical form. 1
It so happens that (81) is already in this form; that is, given a
constellation of values for the exogenous variables and the random
disturbances, if we give yz an arbitrary value, then y 3 determines y%^
which in turn determines y h which in turn affects y z , and so round and
round until mutually compatible values are reached.
8.3. The rationale of limited information
1
The problem of estimating thte model of Fig. 16 can be likened to the
following problem. Suppose that z h z 2 , zz, zk are the locations of four
springs of water and that ?/i, y%, 1/3 are, respectively, the kitchen tap,
bathtub tap, and shower tap of k given house. The arrows are pipes or
presumed pipes. Estimating the first equation is like trying to find
the width of the pipes between z\ and yi and y% and of the pipe between
2/ 2 and y\. The width is estimated by varying the flows at the four
springs z\, z 2 , 23, zk and then measuring the resulting flow in the kitchen
(2/1), bathtub (y 2 ), and shower (2/3). Limited information attempt! to
solve the same problem with the following handicaps arising either from
lack of knowledge or from deliberate neglect of knowledge:
1. Pipes are known to exist for certain only where there &r© £olid
arrows (7, j3).
2. It is known that z 2 , Zz, z* enter the flow somewhere or other ? but
it is not known where.
3. It is not known whether there is another direct pipeline (*y«)
from z\ to the kitchen (2/1) and bathtub (?/ 2 ).
4. The flow at the shower (2/3) is ignored even if it is measurable.
1 Note carefully that the cyclical and the recursive are different forms.
122 LIMITED INFORMATION
So as not to fill up page upon page with dull arithmetic, I am going
to cut model (81) drastically by some special assumptions, which, I
vouch, remove nothing essential from the problem. The special
assumptions are: £ = l.G, 7 = 0.5, 731 — 734 = 0, 1831 = 0.1, /3 2 3 = 2;
72222 f 723Z3 + 72424 is combined into one simple term = 722**;
72 = 0.5. Then (81) collapses to
t/i  Qy 3  yzX = Mi
2/2  £232/3  722** = u 2 (82)
 0312/1 +2/3 = M 3
and Fig. 16 collapses to Fig. 17. Now let us change metaphors.
Fig. 17. Another special case.
Instead of a hydraulic system, think of a telephone network. The
coefficients 0, 7, if greater than 1, represent loudspeakers; if less than
1 , lowspeakers. Where a coefficient is equal to 1, sound is transmitted
exactly. To avoid having to reconcile conflicting observations, assume
that all the disturbances are zero, i.e., that there is neither leakage of
sound out of nor noise into the acoustic system of Fig. 17.
Here is how the estimating procedure works. Begin from a state
of acoustical equilibrium, and measure the noise level at each point of
the network. Then step up the sound level at z** by 100 units.
Only 50 of these reach location 2/2, because there is a twofold lowspeaker
(72 = 0.5) between z** and 2/2. Also step up the sound level at z* by,
say, 10 units. Only 5 units (7 = 0.5) get to 2/1. But, whatever extra
8.4. FORMULAS FOR LIMITED INFORMATION 123
noise there is at y\, onetenth bi it (08i = 0.1) reaches y%. From y% a
loudspeaker doubles the increment as it conveys it to y%, whence some
gets to Vh and so on. By differencing (82) and solving for Aw ■ 0,
Az* = 10, and Az** = 100, the ultimate increments are found to be
A?/i = 125, At/2 = 75, Ay z — 12.5. Now, suppose we did not know how
strong was the lowspeaker connection /3 between t/i and y* By
differencing (82), we get j
* ' Ayi 7 As* . .
When the model is exact, it takes exactly five observations to determine
0, 7, 72, fe, 03i. When the model is stochastic, there are complica
tions, but the basic appearance of the formula is not much different.
The numerator can be interpreted as that change in the sound level y\
not attributable to what is coming over the line from z*, that is to say,
only the sound that comes from y 2 and ?/ 3 . The denominator measures
the increment at yi resulting from two sources z** and y*. The limited
information method just ignores the latter source entirely. This is so
because both /3 2 3 and 3 i belong to the "rest of the model" and are
neither specified nor evaluated. So, (83) is interpreted as follows:
_ variation ;.n y* not due to any z* , R ..
variation in y\ from all sources
The limited information method suppresses /? 2 3 and /3 3 i and estimates
0by
__ A?/i — 7 Az* _ variation in y\ not due to z* ,~ .
72 Az** ~ variation in y* due to z**
Notice carefully that the method suppresses only 2 3, 03i, that is, the
latent model's intervariation of the endogenous variables. It does not
suppress 72, i.e., the variation (in the latent model) due to the exogenous
variables z**. I
8.4. Formulas for limited information
This section shows that the lengthy formulas for computing limited
information estimates are just fancy versions of (83). It can safely
be skipped, for it contains no new ideas. To obtain estimates of the
D
124 LIMITED INFORMATION
#s of the first equation, combine the moments of the variables as in
the following list:
In general In model (81)
1. Construct
C m m y ««(nO 1 m 1! y. *n(vLV,Kn,$t .*..««) ' {n*(«i,«,.t,.i4) 2 ) 1
2. Construct
D  m y vtm BV } , m a v ™ ,,,,, • (m,,,,)" 1 • (m flVl m tlVt )
3. Construct
W SS **l y » y « ~ C
4. Compute 1
V = <
5. Compute 2
Q _ V J W
6. The estimate of the /9s of the first equation Is a nontrivial solution
(called the eigenvector) of Q.
7. Having computed (J, one can calculate f , u, and estimates of the
covariances of the disturbances and the parameter estimates.
In steps 1 and 2 above, the factors m z « z « and m za5 and also the z and
z* in the remaining moment matrices play a role analogous to the
weights cj/2c? in the least squares technique. 8 They just provide a
method for reconciling the conflicting observations generated by the
nonzero random disturbances.
The matrix m y . y . corresponds to the pair of round arrows about y*
in Fig. 15.
Essentially, Q is an estimate of the 0s of the first equation. Q can be
interpreted as a quotient, because the matrix operation V 1 W reminds
one of the ratio of two numbers: W/V. Actually, this impressionistic
intuition is quite correct. W corresponds to an elaborate case of Ae
1 Klein calls this B instead of V. I use V to avoid confusion with the B of
By + Tz — u.
2 Klein calls this A. I use Q to avoid confusion with the A of the model
Ax « u.
3 Compare with Sec. 7.2.
8.6. CONNECTION WITH INDIRECT LEAST SQUARES 125
(the change in the effects), and V is an elaborate case of Ac (the cor
responding change in the causes). Indeed W and V are complicated
cases of the numerator and denominator of (85). W is interpreted as
the variation of the endogenous variables not due to any exogenous
changes, and V expresses the variation of the endogenous variables
from all sources exogenous to any equation of the model and endoge
nous to the manifest part.
8.5. Connection with the instrumental variable method
Limited information recognizes that exogenous influences not present
in the first equation influence the course of events. The instrumental
variable method acknowledges the same thing. Limited information
makes use of all these exogenous influences, whereas the instrumental
variable method (generally) picks from among them, either hap
hazardly or according to the principles of Sec. 7.7.
When the first equation is exactly identified, picking is impossible
and the two methods coincide.
8.6. Connection with indirect least squares
The limited information method can also be interpreted as a form of
modified indirect least squares or as a generalization of directional
least squares (see the Digression in Sec. 4.4). The direct or naive
least squares method estimates /? essentially as the regression coefficient
of y\ on 2/2. Haavelmo's proposition (Chap. 4) advised us to minimize
square residuals in the northeastsouthwest direction in order to allow
for autonomous variations in the exogenous variable, investment z t .
In (81) there are several such exogenous variables z h z 2 , «3, za which
generate in the 2/12/2 plane a scatter diagram which is a weighted
average of lozenge shaped figures (as in Fig. 9), one for zi, one for z it
and so on. In matrix C (and, hence, in W and V) this weighted
averaging has taken place.
Further readings
Hood, chap. 10, describes in detail how to compute limited information
and other types of estimates, and illustrates with a completely worked out
macroeconomic model of Klein's.
CHAPTER 9
The family of simultaneous
estimating techniques
9.1. Introduction
We owe to Theil 1 a theorem showing that all the estimating tech
niques of Chaps. 4 to 8 are special cases of a new technique, which has
the further merit of being fairly easy to compute. Section 9.2, which
covers this ground, is addressed primarily to lovers of mathematical
generality and elegance; other readers might skip or skim.
The other sections of this chapter reconsider underidentification and
overidentification from the point of view of research strategy. Section
9.3 accepts models as given (over, under, or exactly identified) and
suggests alternative treatments. Section 9.4 raises the issue of
whether econometric models can be anything but underidentified.
9*2. Theil's method of dual reduced forms
This method can be applied to all equations of a system, one at a
time. The equation we want to estimate, called the "first" equation,
1 Reference in Further Readings at the end of this chapter.
126
9.2. theil's method op dual reduced forms 127
comes from a complete system, for instance, (81). We know and can
observe all the exogenous variables affecting the system, and we also
know a priori which variables (endogenous and exogenous) enter the
first equation. The other equations may be identified or not. The
disturbances have the usual Simplifying Properties. Any endogenous
variable of the first equation can be chosen to play the role of dependent
variable. We shall use 2/1 in this role. The remaining variables of the
first equation, namely, 2/2, . . . , ya*\ Z\ % . . . , z//», must all be differ
ent in the sample; that is to say, they must not behave as if they were
linear combinations of one another. We do not need to know or
observe the endogenous variables 2/g*+i, . . . , ya not present in the
first equation.
Let one star, as usual, represent presence in the first equation, and
two stars, absence from the first equation.
We then form two reduced forms whose coefficients we calculate by
simple least squares: (1) y* on z* with parameters f and residuals v;
and (2) y* on z = (z*,z**) with parameters p and residuals w. For
instance, to estimate the first equation of (81), compute
2/i = tfnZi + h 2/1  Pn*i + P12Z2 + P13Z3 + P14Z4 + Wi
2/2 = #2iZi f v% 2/2 = P21Z1 + P22Z2 4 P23Z3 + P24Z4 + Wt
(91)
The righthand set in (91) is necessary for estimating the first and
useful for estimating the other equations of (81). Let us omit the
bird ( v ) where it is obvious.
Next, we compute the moments of the residuals on one another and
construct two new matrices D(fc) and N(&):
U\k) sb m.( v ,vo*;*» 'b*)(vv ...V0s*i «aO
; — fcn\( W \vdq+, 0, . . . , 0)(wj wo*; c. . » , . o.»
JM^/cy = m^, vo*\'\ *a«)*i/j j A?lIi( W wq*\ 0)'U>!
where A; is a variable that will be defined below. Then the estimates
of the j8s and 7s of the first equation are given by
est fa . . . ,/3 G *,7i . . . ,y H *)  [D(*)]W(*) (92)
Theil has proved that, if Jc = 0, then (92) gives the naive least
squares estimate with 2/1 treated as the sole dependent variable. If
h *■ 1, (92) gives the method of unweighted instrumental variables
128 THE FAMILY OF SIMULTANEOUS ESTIMATING TECHNIQUES
of Sec. 7.7. If k = 1 f v, where v is the smallest root of
del [m( t , »o»)(»i.... »<?•) — (1 + ^tao^i wo*)(u>i w *)l = (93)
then the estimates of (92) are identical with the limited information
estimates of Chap. 8. All these estimates except for the k = case
are consistent, but biased in the /3s. In the case k — 1, the bias itself
can be estimated and corrected for. 1
These findings not only are exciting for their beauty and symmetry,
but are practical as well. The regressions (91) are straightforward
and attainable by simple calculation (see Appendix B) even for large
systems. The solution of (93) is not too hard, since the number G*
of present endogenous variables seldom exceeds 3 or 4 in any actual
models. But (93) must be calculated over again if we decide to
estimate the second or third equation of the original model. Theil
states that his technique works if the remaining equations of the
system are nonlinear and that it works for large samples even when
some of the z's are lagged values of some y.
9.3. Treatment of models that are not exactly identified
This section gives advice on how to treat models that in their
natural state contain some underidentified or some overidentified
equations, or both. The alternatives are listed from the most desirable
to the least desirable, disregarding the cost of computation.
If a model contains some underidentified equations, we need do
nothing about them unless we wish to estimate them. The remaining
equations, if identified, can be estimated in any case.
If we wish to estimate the underidentified equation, we must make
certain alterations:
1. Make it identified by bringing in parameter estimates from
independent sources, say, crosssection data. There are pitfalls of a
new kind in this method, however, which are noted briefly in Chap. 12.
2. Identify the equation in question by strategically adding variables
elsewhere in the model. This process, however, might deidentify the
rest of the model.
3. Go ahead and estimate the underidentified equation; then, if you
have a priori information on covariances, perform the tests of Sec. G.9
1 Compare with Appendix B.
9.3. TREATMENT OF MODELS NOT EXACTLY IDENTIFIED 129
to detect (or try to detect) whether you have estimated a bogus
i
i
function.
If, on the other hand, the model contains some overidentified
equations:
1. Use the full information, maximum likelihood method. This will
yield consistent and efficient estimates of the identifiable parameters.
2. Use the limited information, maximum likelihood method,
3. Use instrumental variables, weighted.
4. Use instrumental variables, unweighted.
5. In the given equations, add variables where they are most relevant
in such a way as to remove the overidentification.
6. Enlarge the system by endogenizing a previously exogenous
variable.
7. In the original overidentified model, remove the overidentification
by introducing redundant variables in the other equations. If it
turns out that the redundant variable has a significant parameter, you
have succeeded.
8. Drop variables to remove the overidentification. Instead of
outright dropping, you may linearly combine two or more such
variables. This cannot always be done, because the combined
variables are not always present together or absent together elsewhere
in the model.
9. Use the reduced form, and select arbitrarily one of the several
sets of alternative estimates.
Underidentification is a more serious handicap than overidentifica
tion. To remove the former you have to make material alteration! in
the model. To remove the latter you can always use the full informa
tion method.
Whatever the final alterations, I would begin by constructing my
models without worrying about identification. In doing so, I sm sure
that I am acting in the light of my best a priori wisdom, givcm the
objectives of my study and my computing budget. If it turns out
that identification makes alterations necessary, I think that honesty
requires me to keep a record; of the identifying alterations. Like
Ariadne's thread, this record keeps track of my search for a second
best; I may want to give up in frustration and return to try another
way out of the Minotaur's chamber.
130 THE FAMILY OF SIMULTANEOUS ESTIMATING TECHNIQUES
9.4. The "natural state" of an econometric model
Econometricians have devoted a good deal of attention to over
identified models. This entire book, from Chap. 6 on, is devoted to
developing various approximations 1 to the full information method,
which everybody tries to avoid because of its burdensome arithmetic.
According to Liu, 2 we have been wasting our effort, because all well
conceived econometric models are in truth necessarily underidentified:
In economic reality, there are so many variables which have an important
influence on the dependent variable in any structural equation that all
structural relationships are likely to be "underidentified."
So Liu would not use any of our elaborate techniques, but would
estimate just the reduced form and do so by simple least squares.
The reduced form is to include as many exogenous variables as our
knowledge and computational patience permit. Liu would then use
these estimates for forecasting, and claims that they forecast better
than all other techniques.
These subversive ideas deserve careful consideration. Is it true
that structural equations in their natural, unemasculated, noble
savage state are underidentified? If they are, in what sense are
forecasts from the reduced form better?
To begin with, there are occasions in which the investigator does
not care to know the; values of the structural parameters and is content
with some kind of reduced form. To illustrate one occasion of this
sort, assume that the investigator
1. Works from a typical and large enough sample
2. Forecasts for an economy of fixed structure
3. Forecasts from exogenous variables that stay in their sample
ranges
Under the above conditions, an investigator would be glad to work
with a readymade reduced form though not necessarily with parame
ters estimated by simple least squares. He would accept the latter
if justifiable, not for want of anything better.
1 Unweighted and weighted instrumental variables and limited information.
* TaChung Liu, "A Simple Forecasting Model for the U.S. Economy," p. 437
(International Monetary Fund Staff Papers, pp. 434466, August, 1955).
9.4. THE "NATURAL STATE OF AN ECONOMETRIC MODEL 131
Are econometric models necessarily underidentified? Admittedly,
it is an oversimplification, as Liu states, 1 to impose the condition that
certain variables be absent from a given structural equation. But it is
gross "overcomplification" — to coin a muchneeded word—to impose
no condition at all, inviting into the demand for lefthanded, square
headed J/^inch bolts (and on equal a priori standing with the price of
steel) the average diameter of tallow candles and the failure or success
of the cod catch off the banks of Newfoundland. My instinct advises
me to go halfway concerning these new variables: neither leave them
out altogether nor admit them as equals. Consider the model
q + ap + yr + 5/= u
$q+ p =v C?M)
consisting of one underidentified and one overidentificd equation.
Now, if r and / are admitted as equals in the second equation, with
parameters of their own, the whole system becomes underidentified.
But the very knowledge that first convinced us to leave them out of the
second equation now advises us to tack them on with a priori small
parameters, small relative to 0, ?, etc. A reasonable restatement
might be the following:
q + ap + yr + 8f  u
(3q+ p+j(3r + k8f~v v *"°'
where j, h are small constant:?, say Kooo> Koo> or some other not
unreasonable value. And now (wonder of wonders!) both equations
have become identified. The trick does not always work. For
instance, it does not help in
pq r p = w 2
to fill the hole with kar, nor k(3r, nor kyr because we still have three
parameters (a, 0, 7) to estimate and the reduced form contains only
two coefficients 7F1 = m qr /m rr , ih — m pr /mrr. However, if the supply
of exogenous variables is less niggardly than in (96) it is not hard to
find reasonable ways to complete a model so as to identify it in it3
entirety, if we so desire.
The most difficult and dangerous step is the assigning of values to
1 Ibid., p. 405. ;
132 THE FAMILY OP SIMULTANEOUS ESTIMATING TECHNIQUES
j and k. The values must have the correct algebraic sign; otherwise,
structural parameters are wildly misestimated. If the correct magni
tudes forj and k are unknown, it is better to err on the small side than
on the large. Too small (positive or negative) a value of j is better
than a hole in the equation, but too large a value may be worse than a
hole.
9.5. What are good forecasts?
If we want to forecast from an underidentified model, we have no
choice but to use some kind of reduced form; from an overidentified
model, it is convenient, not compulsory, to work from a reduced form.
The entire question in both cases is: What sort of reduced form?
How ought we to compute its coefficients?
To pin down our ideas, we shall consider the model By f Tz = u,
where u has all the Simplifying Properties; in addition we shall make
the covariances a UgUhi known fixed constants, possibly all equal, so as to
keep them out of the way of the likelihood function. This way we
concentrate attention on the structural parameters /?, 7, and x and their
rival estimates. The reduced form is y ** riz + v, where n = — B _1 r,
v = B _1 u. The reduced form contains the entire set of exogenous
variables whether the original form is exactly, over, or underidentified.
Maximum likelihood minimizes
22*
by the jSs and 7s; limited information and instrumental variables
approximate this. The naive reduced form advocated by Liu
minimizes
22*
<
by the ts (whatever these may be). Naturally, the two procedures
are not equivalent, and, naturally, the second guarantees that residuals
will be forecast with minimum variance. 1 But what is so good about
forecasting residuals with minimum variance? The forecasts themselves
1 Provided the sample and structure conform to conditions 1 to 3 of Sec. 9.4.
9.5. WHAT ARE GOOD FORECASTS?
133
in both cases are (in general) biased, but the forecasts by maximum
likelihood have the greater probability of being right.
In Fig. 18, p is the course of future events if no disturbances occur.
The curve labeled p shows the (biased) probability distribution of the
full information, maximum likelihood estimate of p; it is in general
biased (ep s* p) but has its puak at p itself. Curve p m another
maximum likelihood estimate (.say, instrumental variables or limited
information) ; it too has a peak at p but a lower one, perhaps a different
bias £p, and certainly a larger variance than p. The reducedform
least squares estimate is distributed as in curve p; naturally it has a
Ep Ep
Variat le and its forecasts
Fig. 18. The properties of forecasts, p: the true value of the forecast variable
under zero disturbances, p: reducedform least squares estimates, p: full
information maximum likelihood estimates, p: other maximum likelihood
estimates. ;
smaller spread than p and p; it may be more or less biased than either;
but its peak is off p.
To put this into words: If, in the postsample year, all disturbances
happen to be zero, maximum likelihood estimates forecast perfectly,
and least squares forecast impc rfectly. If the disturbances are non
zero, both forecast imperfectly; but, on the average and in the long
run, least squares forecasts are less dispersed around their (biased)
mean.
Which criterion is more reasonable is, I think, open to debate. I
favor maximum likelihood estimates for much the same reason that I
accept the maximum likelihood criterion in the first place: If we arc to
predict the future course of events, why not predict that the most
probable thing (u = 0) will happen? What else can we sanely
134 THE FAMILY OF SIMULTANEOUS ESTIMATING TECHNIQUES
assume — the second most probable? On the other hand, if my job
depends on the average success of my forecasts, I shall choose the
least biased technique and disregard the highest probability of particu
lar instances. If I want to make a showing of unswerving, unvacil
lating steadfastness, I shall use the least squares technique on the
reduced form, even though it steadfastly throws my forecasts off the
mark in each particular instance and in the totality of instances.
Further readings
The reference for Sec. 9.2 is H. Theil, "Estimation of Parameters of Econo
metric Models" {Bulletin de Vinslitut international de statistique, vol. 34, pt. 2,
pp. 122120, 1954). It is full of misprints.
Extraneous estimators are illustrated in Klein, chap. 5, where ho pools
timeseries and crosssection data. Their statistical and commonsense diffi
culties are discussed in Edwin Kuh and John R. Meyer, "How Extraneous
Are Extraneous Estimates?" (Review of Economics and Statistics, vol. 39,
no. 4, pp. 380393, November, 1957).
Tinbergen, pp. 200204, discusses the advantages and disadvantages of
working from a reduced form, but overlooks that its least squares estimation
is maximum likelihood only for an underidentified or exactly identified
system.
Ever since Haavelmo, Koopmans, and others proposed elaborate methods
for correct simultaneous estimation, naive and notsonaive least squares has
not lacked ardent defenders. Carl F. Christ, "Aggregate Econometric
Models" [American Economic Review, vol. 46, no. 3, pp. 385408 (especially in
pp. 397401), June, 19.56], claims that least squares forecasts are likely to be
more clustered than other forecasts; and Karl A. Fox, "Econometric Models
of the U.S. Economy" (Journal of Political Economy, vol. 64, no. 2, pp. 128
142, April, 195G), has performed simple least squares regressions using the
data and form of the KlcinGoldberger model (for reference, see Further
Readings, chap. 1). See also Carl F. Christ, "A Test of an Econometric
Model of the United States 19211947" (UniversitiesNational Bureau Com
mittee, Conference on Business Cycles, New York, pp. 35107, 1951), with
comments by Milton Friedman, Lawrence R. Klein, Geoffrey H. Moore, and
Jan Tinbergen and a reply by Christ, pp. 107129. In pp. 4550 Christ
summarizes the properties of rival estimating procedures. E. G. Bennion,
in "The Cowles Commission's 'Simultaneous Equations Approach': A Sim
plified Explanation" (Review of Economics and Statistics, vol. 34, no. 1, pp.
4956, 1952), illustrates why least squares gives a better historical relation
ship and better forecasts (as long as exogenous variables stay in their
historical range) than do simultaneous estimates. John R. Meyer and Henry
Laurence Miller, Jr., "Some Comments on the 'Simultaneousequation
FURTHER READINGS
135
Approach* " (Review of Economics and Statistics, vol. 36, no. 1, February,
1954), state very clearly the different kinds of situations in which forecasts
have to be made — and to each corresponds a proper estimating procedure.
Herman Wold says that he wrote Demand Analysis (New York: John Wiley
& Sons, Inc., 1953) in large part to reinstate "a good many methods which
have sometimes been declared obsolete, like the least squares regression or
the shortcut of consumer units in the analysis of family budget data" and
to "reveal and take advantage of the wealth of experience and common sense
that is embodied in the familiar procedures of the traditional methods" (from
page x of the preface). He believes that the economy is in truth recursive
and that it can be described by recursive models whose equations, in the
proper sequence, can be estimated by least squares. His second chapter,
entitled "Least Squares under Delate" (especially sees. 7 to 9), is very far
from convincing me that ho is right.
CHAPTER 10
Searching for hypotheses
and testing them
10.1# Introduction
Crudely stated, the subject of this chapter is how to tell whether
some variables of a given set vary together or not and which ones do so
more than others. The problem is how to make three interrelated
choices: (1) a choice among the variables available, (2) a choice among
the different ways they can vary together, and (3) a choice among
different criteria for measuring the togetherness of their variation.
The whole thing is like a complicated referendum for simultaneously
(1) choosing the number and identity of the delegates, (2) deciding
whether they should sit in a unicameral or multicameral legislature,
and (3) supplying them with rules of procedure to use when they go
into session.
This triple task is too much for a statistician, as it is for a citizenry:
it wastes statistical data, as it wastes voters' time and attention.
Just as, in practice, people settle independently, arbitrarily, and at a
prior stage the number of chambers, the number of delegates, and the
rules of procedure, so the statistician uses maintained hypotheses.
136
10 2. DISCONTINUOUS HYPOTHESES 137
I
For example, in the model C< = ja + yZ t + w« of Chap. 1, the presence
of one and not two equations,! two and not four variables, all the
remaining stochastic and structural assumptions, and the requirement
for maximizing likelihood are the maintained hypotheses. Only rival
hypotheses about the true parameter values a and 7 remain to be
tested. The entire field of hypothesis searching and testing consists of
variations on the above theme.' The maintained hypotheses can be
made more or less liberal, or they may change roles with the questioned
hypotheses. Section 10.4 lists many specific examples.
The general moral of this chapter is this: Having used your data to
accept or reject a hypothesis while maintaining others, you are not
supposed to turn around, maintain your first decision, and test another
hypothesis with the same data.i If you are interested in testing two
hypotheses from the same set c'f data, you must test them together.
Thus, if you want to find both the form and personnel of government
preferred by the French, you should ask them to rank on the ballot all
combinations (like Gaillard/unicameral, Gaillard/bicameral, Pinay/
unicameral, Pinay/bicameral) and to decide simultaneously who is to
lead and which type of parliament; not the man first and the type
second; not the type first and the man second.
Everything that follows in this chapter pretends that variables are
measured without error. Sections 10.2 and 10.3 introduce two new
concepts: discontinuous hypotheses and the null hypothesis. Sections
10.4 to 10.8 explore some of the! commonest hypotheses considered by
econometricians, especially wheii they set about to specify a model.
10.2. Discontinuous hypotheses
Consider again the simple model C t = a + yZ t 4 u t . The rival
hypotheses here are alternative Values of ct and y and may be any pair
of real numbers. This is an example of continuity.
Now consider this problem: ]3oes x depend on 1/, or the other way
around? Taking the dependence (for simplicity only) to be linear and
homogeneous, the rival hypotheses here are
x t = yyt + u t versus y t — Bx t + v t
The answer is yes or no; either the first or the second equation holds*
This is an example of discontinuity. However, tne further problem of
138 SEARCHING FOR HYPOTHESES AND TESTING THEM
the size of y (or 5), by itself, may be a continuous hypothesis problem.
Many of my examples below (Sec. 10.4) are discontinuous. The
simple maximizing rules of the calculus do not work when there is
discontinuity, and this fact makes it very interesting.
10.3. The null hypothesis
In selecting among hypotheses we can proceed in two ways: (1)
compare them all to one another; (2) compare them each to a special,
simple one, called the null hypothesis (symbolized by H ). An example
of the first procedure is the maximum likelihood estimation of (#,7)
in the model C t = a + yZ t + u t , since it compares all conceivable
pairs (a,y) in choosing the most likely among them. The other way to
proceed is somewhat as follows: select a null hypothesis, for example,
a = 3 and y = 0.7, and accept or reject it (i.e., accept the proposition
11 either a j^ 3 or 7 ^ 0.7, or both") from evidence in the sample. I
have more to say later on about how to select a null hypothesis and
what criteria to use for accepting or rejecting it. Meanwhile, note that
the decision to proceed via null hypothesis has nothing to do with
continuity and discontinuity, though it happens that many applica
tions of the null hypothesis technique are in discontinuous problems.
10.4. Examples of rival hypotheses
Many of the examples in this section are linear and homogeneous for
the sake of simplicity only ; in these cases linearity (and homogeneity) is
guaranteed not to affect the principle discussed. In other examples,
however, linearity (or homogeneity) is a rival hypothesis ana thus
very much involved in the principle discussed. Now to the examples:
1. Which one variable from a given set of explanatory variables is
best? For instance, should we put income, past income, past con
sumption, or age in a rudimentary consumption function? The rival
hypotheses here are
C t = 0Y, + u t C t = yY t i + u t C t '= tCti + u t etc.
2. Should the single term be linear or quadratic, logarithmic, etc.?
The rival hypotheses here are
10.4. EXAMPLES OF RIVAL HYPOTHESES 139
& = 0Y t + u t C t = yY* f u t C t  6 log Y t + w« etc.
Note that this becomes a special case of example 1 if we agree that
7 2 , log Y, etc., are different variables from Y (Sec. 10.9).
3. What value of the single parameter is best? In C t = &Y t + u t
the rival hypotheses are different values of /?, say, £ = 1, /3 = H,
jg = %, and others. This, too, is a special case of example 1, because
it can be expressed as a choice among the explanatory variables 7, 27,
47/3, respectively.
4. Should there be one or more equations in the model? This ques
tion, important when several variables are involved, lurks behind the
problems of confluence (see Sec. 6.14), but it arises even with two
variables. I
i

The above examples generalize, naturally. For instance, the
question may be which two or which three variables to include, which
linearly, which nonlinearly, how many lags, and how far back.
5. Which variables are to be regressed on which? The rival
hypotheses are
Xi = ax 2 + u versus x 2 « 0xi 4* v
for two variables. If we maintain the hypothesis of three variables in
a single equation, the rival hypotheses are
Xi = aXi f 0x 3 + u versus x 2 ■ 7x1 + 8x 3 + v
versus x 3 « %X\ + f$§ 4 w
And, if we maintain three variables and two equations, i\w rival
hypotheses become
xi = ax% + px s 4 u xi = €X 2 4 fx 3 + w
1 » i versus , n . ,
X2 « t^i 4 5x 3 4 v X3 = rjxi 4 6x2 4 1
j x a « kXi 4" X#i + I
versus , 1
X3 = /*Xi 4 »*X 4* F
and so on for more equations and more variables. This is typically a
discontinuous problem. It is discussed briefly in Sec. 10.8.
6. Having decided that xi is; an explanatory variable, does it help
140 SEARCHING FOR HYPOTHESES AND TESTING THEM
to include #2 as well? The rival hypotheses are
y « axi + v versus y = fax f yx* + w
Section 10.8 contains hints on this problem.
7. Having decided to include x h which one other variable should be
added?
y = ax\ f fat +• u versus y = yx\ + 6x3 + v etc.
Section 10.8 applies to this problem.
8. Is it better to have a ratio model or an additive one?
c v
 s=* a  + u versus c = &y + yn f* v
This is discussed in Sec. 10.10.
9. Is it better to have a separate equation for each economic sector
or the same equation to which is added a variable characterizing the
sector? For example, consider the following rival demand models:
q « ap + u for the poor
a . r lL . 1 versus q = yp + ty + w
g « 0p h t> for the rich v ^ *
where y is income. Section 10.11 discusses this problem.
10. (A special case of the above.) Are dummy variables better than
separate equations?
q = ap f u in wartime q = yp + SQ + w
versus
q = /3p + v in peacetime Q = in peacetime
Q = 1 in wartime
This problem is a special case of the example discussed in Sec. 10.11.
11. Do variables interact? That is to say, does the size of one or
more variables fortify (or nullify) the others' separate effects? For
instance, if being stupid and being old (the variables s and a, respec
tively) are bad for earning income, are stupidity and old age in com
bination worse than the sum of their separate effects? The rival
hypotheses are
y = as 4 fa + w versus y = 73 + 8a f csa f v
10.4. EXAMPLES OF RIVAL HYPOTHESES 14!
and can also be expressed as follows:
y = 75 + da + «sa 4* v Null hypothesis: € =
or as follows:
y = as + fi + w for the young ' .' ' a « f
• * i r iu ri Null hypothesis: _ .
2/ = 73 { 5 + v for the old J " = 1
This case is not spelled out, but the discussion of Sec. 10.6 applies to it.
This list is not exhaustive. And, naturally, the above questions can
be combined into complex hypotheses.
Digression on correlation and kindred concepts
This is a good place to gather together some definitions and
theorems and to issue some simple but often unheeded warnings.
It is also an excellent opportunity to learn, by doing the exercises,
to manipulate correlations and regression coefficients as wall as
all sorts of moments.
Universe and sample. Keep in mind that Greek letters refer to
properties of the universe and that Latin letters are used to refer
to the corresponding sample properties.
Thus, as already explained in the Digression of Sec. 1.1*2, ir*»>
aw <r vv are population variances and co variances of x and y. The
corresponding sample quantities 1 are m XX) m xy , m yyt the socalled
"moments from the sample means," introduced in the game
Digression,
 2 (*.  .*•)<¥.  y°)
nij
1 To the population covarianccs c xy there correspond two types of sample
quantities: those measured from the r.iean of the universe,
qx V  Y (x.  zx){y. 
where 8 runs over the sample S°; and those measured from the mean of the sample,
namely m xy . Interchanging q xv and r,i xy does not hurt at all, in general, when the
underlying model is linear, since m xy is an unbiased, consistent, etc., estimator of
both q xy and <r xy , etc. There are difficulties in the case of nonlinear models, but
we shall not go into them here.
142 SEARCHING FOR HYPOTHESES AND TESTING THEM
where s runs over the sample 5°. The universe coefficient of cor
relation p is defined by
VzxVyv
and the corresponding sample coefficient r by
Later on we define partial, multiple, etc., coefficients of correla
tion. In ail cases, a coefficient of correlation measures the
togetherness of two and only two variables, though one or both may
be compounded of several others. This elementary fact is often
forgotten.
For the sake of symmetry in notation, when handling several
variables, we shall use x with subscripts: xi f z 2 , x 3t etc. Then we
write simply p Uj m, mn for p^x*,), rfaxfch w< Xl )(*i), etc.
Both p xy and r xy range from — 1 to f1. Values very near ± 1
mean that x and y have a tight linear fit like ax + (3y = w, with
the residuals very small. A tight nonlinear fit like x 2 + V 2 = 1
does not yield a large coefficient of correlation p x „. What we need
to describe this fit is p(*«xv»). And similarly for relations like
ay + 3 log x « n or a?/ 2 + &x 3 « w, we need p(io g *Kv>> Pc^'Xv 1 )*
respectively.
10.5. Linear confluence
From now on until the contrary is stated, I shall deal with linear
relations exclusively. The discission is perfectly general for any
finite number of variables, but three are enough to capture the essence
of the problems with which we shall be dealing. Let the three variables
be
X\ number of pints of liquor sold at a ski resort in a day
Xi number of tourists present in the resort area
Xz average daily temperature
We suppose there are one or several linear stochastic relationships
among some or all of these variables. The leastsquaresregression
10.5. LINEAR CONFLUENCE 143
coefficients are denoted by a's and 6's, with a standard system of
subscripts.
Begin with regressions among X\ } Xt, X% } taken two at a time;
there are six such regressions. In (101) below these regressions are
arranged in rows according to the variable that is treated as if it were
dependent and in columns according to the variable treated m
independent.
X\ = ai.2 + 612X2 Xi = Oi. 3 + bitXt
Xi = a<i.\ 4* 621X1 . X 2 = a 2 .3 4 623X3 (101)
X 3 = a 3 .i + 631X1 Xz = a 32 + 632X2
In each subscript, the very first digit denotes the dependent variable.
If there is a second digit before any dot appears, it denotes the inde
pendent variable to which the coefficient belongs. Digits after the
dot (if any) represent the other independent variables (if any) present
elsewhere in the equation. The order of digits before the dot is
material, because it tells which variable is regressed on which. The
order of subscripts after the dot is immaterial, because these digits
merely record the other "independent" variables.
The same three variables can be regressed three at a time. There
are three such regressions:
Xi — <Zi.23 + 612.3X2 + 613.2X3
X 2 = a 2 .i3 + 621.3X1 + 623.1X3 (102)
X3 = a 3 .i2 4 631.2X1 f 632.1X2
As an exercise, consider the fourvariable regression
Xi = ai.234 4" 61:5.34X2 4 613.24X3 4* 614.23X4
and fill in the missing subscripts in
X3 = a__. 4" 6__.__Xi 4* 6 . X2 4 6 . X4
Returning to our liquor example, suppose we decide to measure the
three variables not from zero but from each one's sample mean. If
primed small letters represent the transformed variables, we know
that the a's drop out and the 6's remain unchanged. This is so
because the model is linear. Our relations (101) and (102) now
become
x[ = 612X2 • • '• x' z = bn.al 4 632.1*2
144 SEARCHING FOR HYPOTHESES AND TESTING THEM
Exercises
10.A Prove r<x,)<x,) = *Vx*t'> m r i2, that is to say, that correla
tion does not depend on the origin of measurement.
10. B Prove
rlj = bifin
Hint: Use moments.
This relation says that the coefficient of correlation between two
variables equals the geometric mean of the two regression slopes we
get if we treat each in turn as the independent variable. The less
these two regressions differ, the nearer is the correlation to +1 or — 1.
10.6. Partial correlation
Two factors may account for x[, the sale of a lot of liquor: (1) there
are many people (x\) ; (2) it is very cold (zj). This relation is expressed
X[ m blMXl + fris^a (103)
But the reason that (1) there are many people in the resort is (a) that
the weather is cold, and (possibly) (&) that a lot of drinking is going on
there, making it fun to bo there apart from the pleasure of skiing.
This is expressed
x'z = b 2 i.zx[ + 6 2 3 izj (104)
Suppose we wanted. to know whether liquor sales would be correlated
with crowds in the absence of weather variations. The measure we
seek is the partial correlation between x[ and x' 2 , allowing for x' z . This
measure is symbolized by 7*12.3. It is interpreted as follows:
Define the variables
y[ = x[  613.2^ (105)
vi  4  &,m*J " (100)
The j/8 are sales corrected for weather only and tourists corrected for
weather only. If we have corrected both for weather, any remaining
covariation between them is due to (1) the normal desire for people to
drink liquor (the more tourists the more liquor is sold), (2) the possi
bility that some tourists come to enjoy drinking rather than skiing (the
more liquor, the more tourists), and (3) a combination of the first two
items.
10.7. STANDARDIZED VARIABLES 145
The partial coefficient of correlation is defined by
Exercises
10.C Prove r 2 i.3 = ru.*. j
10.D Prove r\ 2 . 3  612.3&21.3.J This is analogous to Exercise 10.B.
Hint: Substitute (106) and (107) into (103) and (104).
10.E Prove
— 7*12 — ri3r23
12,8 (1 r Wl  A,)*
from definition (107) and Exercises 10.C and 10.D.
10. F Give a common sense! interpretation of the propositions in
the above three exercises.
10.G All this generalizes to four and more variables, but notation
gets wry mossy. Exorcise 10.D generalizes into tho proposition:
Every (partial or otherwise) [coefficient of correlation equals tho
geometric mean of the two relevant regression coefficients. So, for
example, r 2 u . u = 612.34621.34.
Let r stand for the matrix of all simple coefficients of correlation
r»y, and let Rij stand for the minor of r»y. Then Exercise 10.E is
rewritten
R 11 Ri
and with four variables
rim  ii34  fii'H  rlva  g^gjj
and so on for any number of variables, the dimension of R growing
all tho while, of course
10.11 Show that rj 2>8 => R\%/R\\Rn holds but collapses into an
identity when there is no third variable.
i
10.7. Standardized variables
Let us now measure Xi, X 2 , X 3 not only as departures x{, x' 2 , x\ from
their sample means but also in units equal to the sample standard
146 SEARCHING FOR HYPOTHESES AND TESTING THEM
deviation of each. So transformed, the variables are called just
$h %h **•
This step is useful in bunch map analysis (see Sec. 10.8). When
this is done, nothing happens to either the population or the sample
correlation coefficients, but the regression parameters between the
variables do change.
Exercises
10.1 Prove m XlXt = (m XlXl m XlXl )>%< Xl o <*,').
10. J Prove that r^'x*,') « r (*,)<*,) "■ rn by using Exercises 10. B
and 10.C.
10. K Denote the regression coefficients among x[ 9 x' 2f x' s by the
letter b and the corresponding coefficients among the standardized
variables x h x 2 , xz by the letter a, with appropriate subscripts. Inter
pret an.3, 021.3, Aim; show that they differ in meaning from ai.23, 0213,
flan, respectively.
10.L Show that an » 6it(ffiu/ttiu)*S and, in general, that
<ty.» = bij,k(mjj/mu)K t
10. M Show, by using Exercise 10.L, that r*.* ■ flatty.*.
10.N Show that rn «■ an. This is a very important property,
which says that regression and correlation coefficients are identical
for standardized variables.
10.O Let x" = (Xi  eX,)((r tl )^. Prove p< x< ")<*/'> = P(x i nx,) f and
interpret.
10.8. Bunch map analysis
Bunch maps are mainly of archaeological or antiquarian interest.
They seem to have gone out of fashion. Beach (pp. 172175) gives
an excellent account of them which I shall not repeat here. I shall
merely discuss necessary and sufficient conditions under which bunch
maps help to accept or reject hypotheses.
Turn to the example of liquor sales, skiers, and cold weather in
Sec. 10.5. Let xi, rr 2 , x$ be the three standardized variables. Let their
correlation coefficients be
10.8. BUNCH MAP ANALYSIS
1 ria r u
if  r 2 i 1 r n
T31 T32 1 .
147
1 0.5 0.2
0.5 1 0.8
0.2 0.8 1
Compute the least squares regressions of all normalized variables, two
at a time:
X\ = aiiXi Xi = O13X3
xi » anX\ ...... x 2 = a 2S X3 (108)
X3 «■ O31X1 Xz = O32X2
and then three at a time:
X\ = Cti2.3^2 4" 013.2X3
X2 = 0213^1 + O23.1X3
Xz = 031.2^1 + 032.1^2
(109)
Construct now the unit squares, shown in Fig. 19, where marks the
origin. In each block the horizontal axis corresponds to the inde
pendent variable, and the vertical to the dependent. The labels
below the squares show which lis which.
Refer now to the first equation Xi = Oi 2 x 2 in (108). From Exercise
10.N, xi = 7*12X2. Imagine a unit variation in *,he independent variable
x 2 ; then the corresponding variation in Xi, according to this equation,
is an. Plot the point (l,ai 2 ) in the first block of squares. Then go
to the symmetrical equation x 2 ~ a 2 iXi, make Xi vary by Axi = l t and
plot the resulting point (o 2 i,l) in the same block. In a similar way
fill out the top row of Fig. 19, drawing the beams from the origin.
In (109), first consider the variation in xi resulting from variations
in x 2 , other things being equal.: We get three different answers from
(109), one per equation: 
AXi 4 &123 AX2
Axi 4 Ax 2
; 0213
A ^321 A
Axi ~ Ax 2
0312
Digressing a little, I state without proof that
(1010)
Q»/.jfe
R%1
148
SEARCHING FOR HYPOTHESES AND TESTING THEM
A* 2 'i
1 A* 3
1 A* 3
1
, First place: dependent variable
^Second place: independent variable
After dot: variable allowed for
Fig. 19. Bunch maps.
(^22.^23)
lRl2.*13l
1 Ax 3
Therefore we get from (1010) the three statements that A£i:Az 2 is
proportional to RuiR'n, to R22R21, and to —RniRw In the figure
this is depicted, respectively, by the beams marked ( 12.3 ). The three
regressions in general conflict both with regard to the slope and with
regard to the length of the beams.
Derive the corresponding relations for Axi : Ax 3 and A# 2 : Az 3 . These
results are plotted in the last two panels of Fig. 19.
10.8. BUNCH MAP ANALYSIS j 149
i
Exercise
10. P Plot the bunch maps for
p =
1 i 0.6 0.1
0.6 1 0.6
0.1 0.6 1
Scanning Fig. 19 is supposed; to tell us (1) which two variables to
regress if we want to stick to two of the given three, and (2) whether a
third variable is superfluous, useful, or detrimental in some very loose
sense.
What do we look for in Fig. 19? Three things: (1) opening or
closing of bunch maps as you go from the upper to the lower panel,
(2) shortening of the beams, and (3) change of direction of the bunches.
There is no simple intuitive way to interpret the many combinations
of 1, 2, and 3; this is the main reason why statisticians have abandoned
bunch maps.
The examples that follow far from exhaust the possibilities. The
moral of these examples is: To interpret the behavior of the bunch
maps, you must translate them into correlation coefficients r and
try to interpret what it means for the coefficients to be related in one
way or another. But one might as well start with the correlation
coefficients, bypassing the bunch maps altogether.
Example 1. The vanishing beam
What can we infer if beam Ru/Rn shrinks in length? Take the
extreme case Ru ~ and Ru « 0. These imply rn = *Wn and
r 3 s: l, which, in turn, imply r%* «* ±1. and rn m ±P«« Let us
restrict the illustration to the plussign case r 2 3 = 1> f\% m ?n*
The meaning of r 2 z — 1 is that x 2 and x z in the sample, uneofrected
for variations in x\, are indistinguishable variables. Relation r n = fis
shows that, if xi and x 3 were corrected for x h the corrections Would be
identical; the resulting corrected variables are also identical. This
can also be seen from the fact that in these circumstances r a^i ©quals 1.
All this would, of course, be detectable from the top level of Fig. 19,
signifying that three variables are too many and that any two are
nearly as good as any other two.
150 SEARCHING FOR HYPOTHESES AND TESTING THEM
Example 2. The tilting beam
What does it mean if beam Ru/Rn tilts toward one axis without
shrinking in length? For instance, let Ru 7* and Ru = 0. This
implies again that r 2 ;i = ± 1, that is to say, xt = ±x s . Taking again
just the + case, this signifies that the uncorrected x t and x$ are in
perfect agreement. However, Ru — ru  ru ^ or ru 9* n 3 ; take
the case ru < r u for the sake of the illustration. The inequality
ria /^ ru suggests that the corrections of x<i and x 3 to take account of
variations in X\ will be different corrections and will upset the perfect
harmony. This can be seen again from
__ R23 __ 7*23 — ruru __ 1 — rnris . 
r2M (B,JJ,0» (1  rh)»(l  rl t )» (1  r},)»(l  rj,)» *
In terms of our example, there is a spurious perfect correlation between
Xi, the number of skiers, and ar 3 , the weather. It is spurious because
some skiers come to enjoy not the weather but the liquor. However,
liquor sales respond less perfectly to tourist numbers than to weather;
that is, ri2 < ri 3 . Therefore, if you take into account the fact that
liquor too attracts skiers, the weather is not so perfectly predictable a
magnet for skiers as you might have thought by looking at r 23 = 1.
The hypothesis accepted in this case is: Liquor is significant and ought
to be introduced in a predictive model.
Exercises
10. Q Show that, if beam ai 2 . 3 has the same slope as an, this implies
O12.3 = rn and also ri 3 = r 23 and, hence, that all three beams of the
bunch map come together. Interpret this.
I0.R Interpret the situation where all three beams Ru/Rn, Rn/Rn,
and —Rzi/Rn have the same slope. Must they necessarily have the
same length? Must the common slope necessarily equal a i2 ?
10.9. Testing for linearity
If the rival hypotheses are
y = j3x f u versus y = yx* + v
the matter is quickly settled by comparing the correlation coefficients
fxv with r x t. v . Things become complicated if the quadratic function
10.9. TESTING FOR LINEARITY
151
contains a linear term, because the function y «■ yx* + fa <+• v con
tains the linear function j = fix + u as a special case; therefore, we
would expect the correlation to be improved by adding a higher term.
Thus, for any fit giving estimates #, 7, and S, r v .^ x t^ x) is bound to be
greater than r v .tf x) . Correlation coefficients do not give the best tests of
linearity. Common sense suggests something simpler and more
intuitive.
The curves in Fig. 20a represent the two rival hypotheses. If the
/
,yPx
A
A
X
la)
A 2 A 3 x
lb)
Fig. 20. Tests of nonlinearity.
quadratic is true but we fit a straight line, then the computed residuals
from the fitted straight line will be overwhelmingly positive for
some ranges of x and overwhelmingly negative for other ranges. These
ranges are defined in terms of the intersections of the rival curves.
Somewhere left of A most residuals arc negative, and to the right, most
are positive. Complicated nu merical formulas for testing nonlinearity
are nothing but algebraic translations of this simple test.
All this generalizes quite rea dily . For instance, the test of hypothesis
y = ax f u versus a cubic is sketched in Fig. 206; a quadratic versus
a cubic in Fig. 20c. And it generalizes into several variables x, y, z, etc.
152 SEARCHING FOR HYPOTHESES AND TESTING THEM
la each case the test consists in dividing the range of x into several
equal parts Pi, P 2 , . . . , as shown in either Fig. 21a or 216. In each
part compute the average straightline regression residual av u. If this
tends to vary systematically (with a trend or in waves), the relationship
is nonlinear.
When we have three or more variables x> y, z and want to test
linearity versus some other hypothesis, we have to extend to two
dimensions the technique of Fig. 21. Let the rival hypotheses be
x = a + (3y + yz 4 u versus x = 5 + cy f ft/ 2 + yz + Oz 2 + *yz + v
In the yz plane the intersection of these two surfaces projects a
hardtosolvefor and messy curve KLMNP (see Fig. 22a). Instead of
obtaining it, let us see whether we can sketch it vaguely. Divide the
sample range of y arid z into chunks, as shown in the figure (they do not
CO
Pi Pz P 3 Pa
x
Pi Pz Ps
Fig. 21. The interval test.
need to be square, and they may overlap in a systematic way analogous
to Fig. 216). In each chunk, compute the average linear residual av u,
and see whether a pattern emerges. By drawing approximate contour
lines according to the local elevation of av u, we may be able to detect
mountains or valleys, which tell us that the true relationship is non
linear. Something analogous can be done when both rival hypotheses
are nonlinear.
10.10. Linear versus ratio models
The rival hypotheses here are
c XI
 = af3fw versus c = y + 5y + en + v
n n
where u and v have the usual properties to ensure that least squares
fits are valid.
10.11. SPLIT SECTORS VERSUS SECTOR VARIABLE
153
If the ratio model is the maintained hypothesis, then we would
expect av u to be constant over successive segments of the axis y/n.
Translated into the projection on the yn plane, this means that av w
should be constant in the successive slices shown in Fig. 226. For the
linear model, av v should be constant in the squares of Fig. 22c. In
general, one criterion will be satisfied better than the other and will
plead for the rejection of the opposite hypothesis. If both criteria are
K
^—
s
[m
(
/
V
V
/
/
\
(
— 
)
(a)
lb)
(0
Poor Rich
(eft
i Fig. 22
substantially satisfied, then there is no problem of choosing, because
both formulations say that cl y, and n are related linearly and homo
geneously (7 =s 0). One formulation might possibly be more efficient
than the other for reasons of "skedasticity" (compare Sec. 2.15).
10.11, Split sectors versus sector variable
The rival hypotheses here are whether the demand for, say, sugar
should be estimated for all consumers as a linear function of price and
154 8EARCHING FOR HYPOTHESES AND TESTING THEM
income q =* yp + 6y + w (where the price paid is uncorrelated with
income) or should be split into several demand functions q = ap + w,
q — Pp + v, etc., one for each income class, on the ground that price
means more to the poor than to the rich.
For illustration it is enough if we have just two income classes, the
rich and the poor, corresponding to, say, y «= 10, y « 1. Nothing
essential would be added if y were taken as a continuous variable.
As in Fig. 22c, construct a grid for the sample range of variables y
and p. If av w is constant, the single equation q — yp \ 8y \ w
is good enough, and, moreover, we have a = in the alternative
hypothesis. If,, however, the second hypothesis is correct, not only
will a be very different from /§, but av w will display contours like
those of Fig. 22d.
10.12. How hypotheses are chosen
In this section I am neither critical, nor constructive, nor original.
I think it proper to look at the way that statistical hypothesis mak
ing and testing takes place around us.
The econometrician, geneticist, or other investigator usually begins
with (1) prejudices instilled from previous study, (2) vague impressions,
(3) data, (4) some vague hypotheses.
He then casts a preliminary look at the data and informally rejects
some because they represent special cases (war years, for instance,
or extremely wealthy people) and others because they do not square
with the vague hypotheses he holds. He uses the remaining data
informally to throw out some of his hypotheses, from among those that
are relatively vague and not too firmly grounded in prejudice.
At this stage he may prefer to scan the data mechanically, say, by
bunch maps, rather than impressionistically. Mechanical prescreen
ing is used (1) because the variables are many, and the unaided eye is
bewildered by them, and (2) because the research worker is chicken
hearted and distrusts his judgment. Logically, of course, any
mechanical method is an implicit blend of theory and estimating
criteria; but, psychologically, it has the appearance of objectivity.
The good researcher knows this, but he too is overwhelmed by the
illusion that mechanisms are objective.
Having done all this, the investigator at long last comes to specifica
10.12. HOW HYPOTHESES ARE CHOSEN 155
tion (as described in Chap. 1) ; he then estimates, accepts, rejects, or
samples again.
This stagebystage procedure is logically wrong, but economically
efficient, psychologically appealing, and practically harmless in the
hands of a skilled researcher with a feel for his area of study.
Instead of proceeding stage by stage, is there a way to let the facts
speak for themselves in one grand test? The answer is no. We must
start with some hypothesis or we do not even have facts. True,
hypotheses may be more or less restrictive. But the less restrictive the
hypotheses are, the less a given body of data can tell us.
Further readings
For rigorous treatment of the theory of hypothesis testing, one needs to
know set theory and topology. Klein's discussion, pp. 5662, gives a good
first glimpse of this approach and a good bibliography, p. 63.
For treatment of errors in the variables, consult Trygve Haavelmo, "Some
Remarks on Frisch's Confluence Analysis and Its Use in Econometrics,"
chap. V in Koopmans, pp. 2582C5.
Beach discusses bunch maps and the question of superfluous, useful, or
detrimental variables, pp. 174175. Tinbergen, pp. 8083, shows a five
variable example.
Cyril H. Gouldcn, Methods of Statistical Analysis, 2d ed., chap. 7 (New
York: John Wiley & Sons, Inc., 1952), gives an elementary discussion of
p and the sample properties of its* estimate r.
CHAPTER 11
Unspecified factors
11.1. Reasons for unspecified factor analysis
Having specified his explanatory variables, the model builder fre
quently knows (or suspects) that there are other variables at work that
are hard to incorporate.
1. The additional variable (or variables) may be unknown, like the
planet Neptune, which used to upset other orbits.
2. The additional variable may be known but hard to measure. For
instance, technological change affects the production function, but
how are we to introduce it explicitly?
There are two ways out of this difficulty: splitting the sample, and
dummy variables. When we split the sample we fit the production
function to each fragment independently in the hope that each frag
ment is uniform enough with regard to the state of technology and yet
large enough to contain sufficient degrees of freedom to estimate the
parameters. The technique of dummy variables does not split the
sample, but instead introduces a variable that takes on two and only
two values or levels: when, say, there is peace, and 1 when there is war.
Phenomena that are capable of taking on three or more distinct states
are not suited to the dummy variable technique. For instance, it
156
11.2. A 8INGLE UNSPECIFIED VARIABLE 157
would not do to count for peace, 0.67 for cold war, and 1 for shooting
war, because this would impose an artificial metric scale on the state
of world politics which would affect the parameters attached to honest
togoodness, truly measurable variables. No artificial metric scale is
introduced by the twolevel dummy variable.
3. The additional factors at work may be a composite of many
factors, too many to include separately and yet not numerous enough
or independent enough of one another to relegate to the random term
of the equation. j
4. The additional variable may be known and measurable, but
we may not know whether , to include it linearly, quadratically, or
otherwise. j
5. The additional variable may be known, measurable, etc., but
not simple to put in. To admit a wavy trend line, for instance, eats
up several degrees of freedom.
I
In such cases the unspecified variable technique comes to our rescue,
at a price, because it sometimes requires special knowledge. In the
illustration of Sec. 11.2, for instance, to estimate a production function
that shifts with technological change, time series are not enough. The
data must contain information about inputs and outputs broken down,
say, by region, or in some dimension besides chronology.
i
11.2. A single unspecified variable
This section is based on the technique developed by C. E. V. Leser 1
in his study of British coal mining during 19431953, years of rapid
technological change, nationalization, and other disturbances.
He fitted the function P rt ~ QtL"fini where P is product, L is labor,
C is capital, g t is the unspecified impact of technology, r and t are
regional and time indices, and a, are the unknown parameters.
Here for exposition's sake, I shall linearize his model and drop the
second specified variable. Consider then
P rt « g t + aL rt + U rt (111)
1 C. E. V. Leser, "Production Functions and British Coal Mining" (Rfflnometrica,
vol. 23, no. 4, pp. 442446, October, 1955).
158 UNSPECIFIED FACTORS
The following assumptions are made:
1. Technology affects all regions equally in any moment of time.
2. The same production function applies to all regions.
3. The random term is normal, with a period moan
R
2 u «
R
equal to zero, and a regional mean
\
<i
also equal to zero. We shall now use the notation av[r]itrt and av[^]u r<
for expressions like the last two. l
Now, keeping time fixed at it «■ 1, let us average inputs and
outputs over the R regions. From (111) we get, remembering that
av[r](7 f ,  g h}
av[r]P r < t  g tl + a av[r]L rt , (112)
And, by subtracting (112) from (111), we get the following relation
between P' ril and U rtl , which are product and labor measured from their
mean values of period 1 :
P' rh  «I4 + u rh (113)
Do the same for t = 2, , . . , T and then maximize the likelihood of
the sample. Under the usual assumptions, this is equivalent to
minimizing the sum of squares
rt
The resulting estimate of a is
& = ^ (114)
ThL'L'
In this expression the moments are sums running over all regions and
time periods.
1 Read "average over the r regions," "average over the t years."
11.3. SEVERAL UNSPECIFIED VARIABLES 159
Having found a, we can go pack to (112) to compute the time path
g t of the unspecified variable, technology.
The method I have just outlined has several advantages:
1. It uses R XT observations (a largo number of degrees of freedom)
in estimating the parameter al.
2. Unlike split sampling, it i obtains a single parameter estimate for
all regions and periods.
3. It yields us an estimate of the unspecified variable (technological
change), if it is the only other factor at work.
4. This technological change does not have to be a simple function
of time. It may be secular, cyclical, or erratic; it can be linear,
quadratic, or anything else. !
5. The method estimates, in addition to technology, the effects of
any number of other unspecified variables (such as inflation, war,
nationalization) which at any moment may affect all regions equally.
i
The chief disadvantage of! the technique is that the unspecified
variable g t has to be introduced in a manner congenial to the model,
that is to say, as a linear term in a linear model, as a factor in Leser's
logarithmic model, and so forth j otherwise it would not drop out,
as in (113) when we express the specified variables as departures
from their average values.
For the unspecified variable technique to be successful it is necessary
that the data come classified in one more dimension than there are
unspecified variables. Thus P and L must have two subscripts.
Moreover, each region must have coal mines in each time period. 1
11.3. Several unspecified! variables
Imagine now that we wish to explain retail price P in terms of unit
cost C, distance or location D, monopoly M, and the general level of
inflation J. Cost is the specified variable, and location, monopoly,
and inflation are left unspecified for one or another of the reasons I
1 There are methods for treating lacunes, or missing data, but these are rather
elaborate and will not be discussed in this work. The usual way to treat a lacune
is by pretending it is full of data that interpolate perfectly in whatever structural
relationship is finally assigned to the original data.
160 UNSPECIFIED FACTORS
recounted in Sec. 11.1. The model, assumed to be linear, is
Pfirt = Mi + D r +Jt + aCfM + Uf irt (115)
where the subscripts /, t, r, t express firm, industry, region, and time.
The model, as written, maintains that the degree of monopoly is a
property of the industry only, not of the region or of inflationary
situation or of interactions among the three. Similarly, inflation is
solely a function of the time and not of the degree of monopoly and
location of industry. Note again that the data have to come classified
in one more dimension than there are unspecified variables. Thus P
and C must have four subscripts, one for each of the unspecified
variables, plus an extra one (firm /). Moreover, unless we have
lacunes, each firm must be present in each industry, region, and time
period. The firms of Montgomery Ward and Sears Roebuck would
do, 1 and the industries they enter can be, say, watch retailing, tire
retailing, clothing retailing, etc.
In that case, a is estimated analogously to (114) by & = rnp'c'/m C 'c' f
where the moments are sums running over /, t, r, t. Having esti
mated a, we can now define a new variable S, the pricecost spread
S = jP  dC. The model is now
Sfrt = Mi + D r + J t + v firt (116)
Estimating M , Z), and J is the socalled problem of linear factor analysis.
11.4, Linear orthogonal factor analysis
Linear factor analysis attempts to explain the spread S as an additive
resultant of two or more separate factors; in the example of (116)
there are three factors: monopoly, region, and inflation.
Nothing essential is lost if we confine ourselves to two factors, say,
monopoly and inflation, and consider the simpler model
Sf*  Mi + J t + p,u (H7)
To grasp its essence, imagine that there are no random disturbances
(y = 0) and that there is only one firm, which sells three products
1 Provided both exist in all time periods, regions, and industries included in
the sample.
11.4. LINEAR ORTHOGONAL FACTOR ANALYSIS
161
(tires, watches, clothes) over 5 years. Observations can be put in a
3by5 table or matrix whose rows correspond to the commodities and
columns to the years:
S =
Factor analysis seeks to express this table as the sum of two tables
M and J of similar dimensions, the first with constant rows and the
second with constant columns*.
s t ,
$12
$13
S u
$15
$21
$22
$23
$24
$25
$31
$32
$33
$34
$35
M =
Mi Mx Mr Mr j Mr
Mi Mi M 2 Mt \ M 2
Mi Mz M z M 3 i Mz
J =
J\ Ji J* J\ */§
Jl Ji Jz Ji JjJ
J\ Jz Jz Ji J&
In a practical problem this cannot be done exactly, particularly if
several firms are involved. This is the familiar problem of conflicting
observations, which is treated in Sec. 7.2. In practice, some com
promise is found which gives the M and J that "fit best" the observa
tions S.
A graphic way to express the problem of factor analysis is the
following. You are given a rectangular piece, say, 3 by 5 miles, of a
topographical map with contour lines showing the elevation at various
spots. You are supposed to find a landscape profile running from
north to south and another one running from east to west with the
property that, if you slide the bottom of the first perpendicularly along
the humps and bumps of the second, the top crests describe the original
surface of the 3by5 map. The same happens if you interchange the
roles of the two profiles. The two profiles are kept always perpen
dicular to each other; and this is why the literature calls the two fac
tors M and J orthogonal (that is to say, rightangled) . (See Fig. 23.)
Computing differences among the various entries in M and J is a
simple matter under the usual assumptions. Again, we minimize the
expression
2'
fit
with respect to M h M if Mz, Ji, J 2, Ji, Ji, Ji Thus the solution for
162
Mi is
and that of J% is
UNSPECIFIED FACTORS
{J*»'2*}
FT
^3 = ^—  i ~
F/
(118)
(H9)
where F, /, jT are the total number of firms, industries, and time
periods, respectively. Note that, to estimate the degree of monopoly
in the first industry, we need knowledge of inflation in all years; to
estimate inflation in year 3, we need measures * " monopoly for all
Elevation
Elevation
y*K v/\y"~~
North 3 miles South West
Fig. 23
5 miles
East
industries. Equation (118) can be rationalized as follows: to estimate
the effect of monopoly in the first industry, disregard the pricecost
spread in all other industries, and compute the overall (firmtofirm
and periodtoperiod) average spread in industry 1 :
£S/u
FT
From this deduct the average inflationary impact
2*
What is left is the monopoly impact.
11.5. TESTING ORTHOGONALITY 163
11.5. Testing orthogonality
It is entirely possible for inflation's impact on the pricecost spread
to be related to monopoly. Indeed, there is evidence from th© Second
World War that price control was more successful in monopolistic
industries (and firms) than in competitive ones. A monopolist or
monopolistic competitor is recognized and remembered by the public.
If he takes advantage of inflation, he may lose goodwill or perhaps be
sued by the government as an example to others. If monopoly and
inflation interact in this way or in some other way, the linear model
(117) is not applicable. Because it is simple, however, we may adopt
it as our null hypothesis, fit b, and look for a systematic pattern dis
crepancy as a test of the hypothesis.
The formulas for doing this are rather complicated expressions,
which I shall not bother to state. Intuitively the test is quite simple.
If by rearranging whole rows f.nd whole columns, table S can be made
to have its highest entry in t,ho upper lefthand comer, its smallest
entry in the lower righthand corner, with each row and column
stepping down by equal amounts, the null hypothesis holds. For
example,
s =
[S
15
17
5]
can be rearranged thus:
S' =
■[S
14
12
I?]
Note that
S' =
[15
15
12
12
Sl+R
o\ < n  10 >
To state the same test in terms of our geographic profiles of Sec. 11.4:
Cut up the original map into northsouth strips, rearrange, and then
glue them together. Then cut the resulting map into eastwest strips
and rearrange these. Should this procedure produce a map of a terri
tory (1) sloping from its northwest corner down to its southeast corner,
(2) with neither local hills nor saddle points, and (3) such that, if you
stand anywhere on a given geographical parallel and take one step
south, you step down by an equal amount, say, 3 feet, and (4) such
164 UNSPECIFIED FACTORS
that, likewise, if you start from any point on a fixed meridian, one
eastward step loses the same elevation, say, 2.1 feet, then the factors
are orthogonal.
In arithmetical terms, having estimated Afi, M2, . . . , Js, rearrange
the rows and columns so that the most monopolistic industry occupies
the top row and the most inflationary year occupies the leftmost
column. Compute the residuals
tfu = Sfu  ifti  Ji
and place their sums
/
in the appropriate row and column. Any run, or large local concentra
tion, of mostly positive or mostly negative residuals is evidence that
monopoly and inflation have interaction effects (are not orthogonal
factors).
11.6. Factor analysis and variance analysis
Unspecified factor analysis, the technique explained in this chap
ter, should be carefully distinguished from variance analysis (and
from factor analysis in the principal components sense of the term).
Both techniques make use of a rowcolumn classification, and both
usually proceed on the null hypothesis that rows and columns do
not interact. But here the similarities end. Factor analysis meas
ures the row and column effects for each row and column, i.e., it
computes the unspecified variable. Variance analysis attributes
various percentages of total variance 1 to differences among all rows, to
differences among all columns, and the remainder to chance. Factor
analysis endswith / f T estimates $1 u&t, . . . ,lSti\3i,J% 9 . . . ,«/r.
Variance analysis ends with three percentages expressing row variance,
column variance, and unexplained variance in terms of total variance.
1 Total variance in terms of the example, model (11*7), is
Y0S/« avS)»
fit
FIT
where av S is av[/i7]<S/,r, or the average spread over the entire sample.
11.6. FACTOR ANALYSIS AND VARIANCE ANALYSIS 165
In the course of analysis of variance, row means (14% and 12%) and
column means (16, 13, and 12) are computed, but they are only aux
iliary quantities, not estimates of factor impacts. However, the
differences in these two sets of means are equal respectively to the
differences in the impact [(2 and 0) and (15, 12, and 11)] of the two
variables into which S' is factorable [see equation (1110)].
It is not my intention to go into the details of variance analysis.
Just three comments about it:
1. The reason why people analyze variance and not the fourth or
seventeenth moment of the sample is this: A normal distribution with
zero mean (such as the error term V/u) can be completely described by
its variance. The variance is a sufficient estimate, for it contains all
the information that is implicit in the assumed distribution.
2. Under orthogonality, row, column, and unexplained variances
add up to total variance, just as the square on the hypotenuse equals
the sum of the squares on the other sides of a rightangled (orthogonal)
triangle.
3. Under normality and orthogonality, variance ratios have certain
convenient distributions, which are suitable for testing the null
hypothesis (that rows or columns differ only by chance).
Further readings
Harold W. Watts, "Longrun Income Expectations and Consumer Saving,"
in Studies in Household Economic Behavior, by Dernburg, Rosett, and Watts
(Yale Studies in Economics, vol. 9, pp. 103144, New Haven, Conn., 1958),
makes judicious use of dummy variables.
Robert M. Solow, "Technical Change and the Aggregate Production
Functions" (Review of Economics and Statistics, vol. 39, no. 3, pp. 312320,
August, 1957), computes the unspecified variable "technology" not, as we
have done in Sec. 11.2, by interregional aggregation, but by using the marginal
productivity theory of distribution.
Variance analysis is a vast subject. See Kendall, vol. 2, chaps. 23 and 24.
CHAPTER 12
Time series
12.1. Introduction
A time series x(t) — [x(l), . . . , x(T)] is a collection of readings,
belonging to different time periods, of some price, quantity, or other
economic variable. We shall confine ourselves to discrete, consecutive,
and equidistant time points.
Like all the kinds of manifestations with which econometrics deals,
economic time series, both singly and in combination, are generated
by the systematic and stochastic logic of the economy. The same
techniques of estimation, hypothesis searching, hypothesis testing, and
forecasting that work elsewhere in econometrics work also in time
series.
Why then a chapter on time series? Why indeed, were it not for the
large amount of muddle and confusion we have inherited from many
decades of wellintentioned but faulty investigations.
The earliest and most abused time scries are charts of the business
cycle and security market behavior. Desiring knowledge, business
cycle "physiologists" avoided all models, assumptions, and hypotheses
in the hope that the facts would speak for themselves. Pursuing
profit, stock market forecasters have sought and are seeking (and their
clients are buying) short cuts to strategic extrapolations; they have
166
12.2. THE TIME INTERVAL 167
cared nothing about the logic, whether of the economy or of their
methods. Their Economistry is the crassest of alchemies.
The key ideas of this chapter are these: Facts never speak for
themselves. Every method of looking at them, every technique for
analyzing them is an implicit econometric theory. To bring out the
implicit assumptions for a critical look, we shall study averages,
trends, indices, and other very common methods of manipulating data.
I do not mean to condemn the traditional approaches altogether.
Certainly, physiology and "mere" description can do no harm — for
ultimately they are the sources of hypotheses. To look for quick,
cheap, and simple short cuts to forecasting is a reasonable research
endeavor. Furthermore, modern machines can help by doing much of
the dull work, provided that an intelligent being is available to study
their output.
12.2. The time interval
Up to now I have carefully avoided any discussion of time. In the
model of Chap. 1
C t  cc + yZ t + u t (121)
what does t — 1, 2, . . . , T represent, and why not select different
intervals?
The secret is that the time interval t, the parameters a and y, the
variables C and Z, and the stochastic term u must be defined not
without thought of but with regard to one another. If the time
interval is short, then y must be the shortrun marginal propensity to
consume. If t is a year, then it makes sense for Z to be treated as
predetermined. As the time interval is shortened, more and more
variables change from predetermined to simultaneously determined.
With shorter and shorter time periods, the causes that generate the
random terms overlap more and more and invalidate the assumption of
serially independent random disturbances.
In certain cases we deliberately reduce the number of time Intervals
of our data in order to bring time into agreement with the parameters
and stochastic assumptions. For example, if we are trying to estimate
a production or cost function and have hourly data for inputs and
outputs, we may lump these into whole working days; otherwise the
disturbances during the morning warmup period, coffee break, lunch
168 TIME SERIES
time, and the Various peak fatigue intervals are not drawn from the
same Urn of Nature.
The smoothing of time series must be done with care. In the above
example, if the purpose is to make the random disturbance come from
the same Urn in each interval, then overlapping as well as nonover
lapping workdays will do. If we also want the disturbances to be
serially independent, then only nonoverlapping days should be used.
Digression on moving averages and sums
Moving averages differ from moving sums only by a constant
factor P equal to the number of original intervals smoothed
together.
If P is even ( = 2N) the average or sum should be centered
on the boundary between intervals N and N + 1. If 2N + 1
intervals are averaged, center on the (N f* l)st.
There are many smoothing methods besides the unweighted
moving average. We are free to decide on the span P of the
moving average and on the weight to be given each position within
the span. Given P successive points, we may wish to fit to
them a least squares quadratic, logistic, or other curve. Every
particular curve implies a particular set of weights, and con
versely. Fitting a polynomial of degree Q through P points can
be approximated by taking the simple moving average of a simple
moving average of a simple moving average . . . enough times
and with mi table spans.
All this is straightforward and rather dull, unaccompanied
by theoretical justification. What makes moving averages
interesting is the claim that they can be used to determine and
remove the trend of a time series. We shall see in Sec. 12.8 how
dangerous a technique this is. As we shall see in Sec. 12.5, mov
ing averages give rise to broad oscillations where none exist in
the original series.
12.3. Treatment of serial correlation
The term serial conelation, or autocorrelation , means the noninde
pendence of the values u t and u t e of the random terms. The term
autorcgrcssion applies to values avand x t $ when cov (x t) x t o) ^ 0.
12.3. TREATMENT OF SERIAL CORRELATION 169
In this section we consider briefly (1) the sources of serial correlation,
(2) its detection, (3) the allowances and modifications, if any, that it
occasions in our estimating techniques, (4) the consequences of not
making these allowances and modifications.
Random terms are serially correlated when the time interval l is
too short, when overlapping observations are used, and when the data
from which we estimate were constructed by interpolation. Thus,
if, in
Ct  a + yZ t + u t
t measures months or weeks, then the random term has to absorb the
effects of the months' being different in length, weather, and holidays,
effects which are not random in the short period but which follow
a cycle of 365 days. If, however, t is measured in years, then all these
influences are equalized, one year with another, and u t loses some of
its autocorrelation. Similarly, if successive sample points are dated
"January to December/' "February to January," "March to Febru
ary," and so on, successive random terms are correlated at least 10/12
(10 being the number of months common to successive samples).
Frequently the raw materials of econometric estimation are con
structed partly by interpolation. For instance, there is a ceniui in
1950 and in I960. Annual sample surveys in 1951, 1952, . . . meas
ure births, deaths, and migrations; these data, cumulated from 1950,
should square with the census population figure of 1900. Slnej this
seldom happens, the discrepancy in the final published figures tl
apportioned (in general, equally) among the several years of thy ii@id@s
The resulting annual figures for birth rate, etc., share equal portions
of a certain error of measurement and are, therefore, correlated more
than they otherwise would be. In a model that uses annual data on
the birth rate and assumes that it is measured without error, it is the
random term that absorbs the yeartoyear correlation.
We shall illustrate with the simple model (121). There are
two ways to detect serial correlation. One is to maintain the null
hypothesis that none exists:
cov (u h Ut$)  (122)
estimate the model on this assumption, and then check whether
m (t/,)(ti,*; is near zero. The other way is to maintain that the random
170 TIME SERIES
disturbances do have a serial connection, such as
u t = f iw«i +•"••+ teUte + v t (123)
(with v t random and nonautocorrelated), and estimate the fs to see
whether they are significantly different from zero.
The first method is arithmetically easier, though a less powerful test.
This requires explanation. The likelihood function of our sample is
the same as (24) :
L  (2tt) 5 / 2 det (d ttU )* exp [Mu(* uu ) l u]
where u stands for th(3 successive random disturbances (ui,W2, . . . ,Us).
In minimising L by a and 7, we should get the greatest efficiency in
& and $ if we took account of the fact that d is no longer diagonal
when there is serial correlation among the disturbances. The null
hypothesis cov (u t) u>e) = 0, though it does not bias 61 or 1 or make
them inconsistent, does nevertheless increase their sampling variances
and covariances. The #'s are computed with the help of the inefficient
& and i and are themselves inefficient estimates of the true dis
turbances. Therefore w^xa..*) is an inefficient (i.e., overspread)
estimator of cov (u t ,v t $) and provides a flabby test of serial correlation.
It does not reject the null hypothesis with so much confidence as a
more powerful test (i.e., one associated with a very pinched distribution
W(tf.)<«._0))
Instead of testing by m^,)^,_ B ) it is recommended that we compute
the expression
■which happens to have convenient properties, which are of no concern
to the present discussion. It is easily seen that, if m^^^g) = 0, then
D(0) = 2. Large departures from this value indicate that the null
hypothesis is untrue.
The second method for taking into account the serial correlation of
the random disturbance is more efficient than the first, but biased.
To see this, consider the special case
Ct^a + yZt + Ut (124)
u t = fu,i + v t (125)
12.4. LINEAR 8YSTEMS 171
As good simultaneousapproach proponents, we combine the two
equations as follows:
C t  fC«i  a + y(Z t  fZ«..i) + V t (126)
and maximize the joint likelihood of the random disturbances v with
respect to the three parameters a, ?, f . Unfortunately, not only does
this lead to a highorder system of equations, but the maximum likeli
hood estimates are biased. The reason for bias is the sam§ as in
Chap. 3, namely, that (126) is a model of decay.
There is yet a third method, which is somewhat biased and somewhat
inefficient. First fit (124) by least squares, ignoring (125): this step
is inefficient. Then compute f from (125) using the residuals
H of the previous step: this introduces the bias. Next construct
the new variables c t = C t — fC*_i, z t ~ Z t — l%t\ and fit by least
squares c t = a + yz t + w t to get a new approximation to a and 7.
Repeat the cycle any number of tfe\?s.
When several equations have autocorrelated error terms, this biased
second method always works in principle. The first and third methods
are dangerous to use because we know practically nothing about how
good Sm^ t _ e /(S — 6)mn is as an estimator of the regression coeffi
cient of u t on u t e', nor do we know whether the cyclical procedure of the
third method converges.
Matters get rapidly worse the more complicated the dependence of
u t on its past values.
12.4. Linear systems
Most business cycle analysis proceeds on the assumption (sometimes
explicitly stated, more often not) that an economic time series x(t)
is made up of two or more additive components f(t), g(t) f . . . called
the " trend," the "cycle," the " seasonal, " and the " irregular." Trend,
cycle, and seasonal are supposed to be, in some relevant sense, rather
stable functions of time; the irregular is not. We shall use the expres
sions "irregular," "random component," "error," and "disturbance"
interchangeably. The word "additive" signifies, as usual, lack of
interaction effects among the components. 1
In analyzing time series, the problem is to allocate the observed
1 See Sec. 1.11.
172 TIME SERIES
fluctuations in x to its unknown additive components:
*(<)  /(0 + 9(f) + *W + Wf (127)
and to find the shapes of/, <7, and Jr. whether they are straight lines,
polynomial or trigonometric functions, or other complicated forms.
As stated, the problem is indeterminate. The facts will never
tell us either how many additive terms the expression in (127)
should have or what shapes are best. As usual, we must maintain a
hypothesis — that the trend is, say, a straight line:
that the cycle is some trigonometric function, e.g.,
7 sin (5 h et)
and so forth, the problem being to estimate the Greek letters from data
or to see how well a given formulation fits in comparison with some
rival hypothesis.
Trigonometric functions can be approximated by lagged expressions,
such as
0(0  7o + yix(t  1) + y 2 x(t  2) + • • • + y Q x(t  Q) + t* (128)
with appropriate coefficients. The term "linear" expresses the addi
tivity of the components of (127) or the linear approximation of
(128) or both. In this section and in several more, we shall consider
linear systems of a single variable x(t). Linearity in the second sense
(above) is very handy, because in linear systems the number of lags
in (128) and the values of the yB determine whether g(t) oscillates,
explodes, or damps; the initial value g(0) determines only the amplitude
of the fluctuations. In nonlinear systems amplitude and type are not
separable in this way.
We shall devote Sees. 12.5 to 12.7 to a priori trendless systems; then,
in Sec. 12.8, we shall inquire how we know a system to be trendless
and, if it has a trend, how this trend can be removed.
12.5. Fluctuations in trendless time series
A trendless or a detrended time series can be random, oscillating, or
cyclical. It is random if it can be generated by independent drawings
12.5. FLUCTUATIONS IN TRENDLESS TIME SERIES 173
from a definable Urn of Nature. It is cyclical (or periodic) if it repeats
itself perfectly every 12 time periods; it is oscillatory if it is neither
random nor periodic.
A simple trigonometric function like sin (2irt/Q) or sin (2irt/Q) f b
sin (2irt/ti) is strictly cyclical. The combination of two or more trigo
nometric functions with incommensurate 1 periods Qi, fig, . . . [for
instance, x(t) = sin (27r//fii) + «os (27r//ft 2 )] is not periodic but oscil
latory. Commensurate periods Q if 12 2 , . . . appear in (127) only in
the trigonometric terms sin, cos, tan, etc., and not as multiplicative
factors, exponents, etc.
With the exception of purely seasonal phenomena (which are
periodic), economic time series are overwhelmingly of oscillating type.
Oscillations arise from three sources: (1) the summation of non
stochastic time series with incommensurate periods, (2) moving
averages of random series, and. (3) autoregressive systems having a
stochastic component.
We can briefly dispose of the last case first. If x(t) is an autogressive
variable
x(t)  aix(t  1) + '. . . + a H x(t  H) +u< (129)
whose systematic part would damp if u were to be continually zero,
then x(t) can be expressed as a weighted moving average of the random
disturbances, and so the third case reduces to the second case above. 2
The moving average of a raidom series, however, oscillates! This
proposition, the Slutsky proposition, shocks the intuition at first and,
therefore, deserves some discussion. Let us take a time series so long
that we do not have to worry about any shortage of material to be
averaged by moving averages. Consider now a moving average
spanning P of the original periods. To facilitate the exposition, let
us take P amply large. Now the original series u(t) y if it is random,
should itself be neither constant nor periodic. Because if it is constant,
it is not random. And if it is periodic, a given value of u depends on
the previous one; hence u{t) is not random. A truly random series is
neither full of runs and patterns nor entirely bereft of them. Just as a
true die, once in a while, produces runs of sixes or aces, so a random
*Two real numbers are incommensurate when their ratio is not a rational
number.
2 See Kendall, vol. 2, pp. 406407.
174 TIME SERIES
time series occasionally exhibits a run. For the sake of illustration,
suppose the run is 3 periods long and Wioi = W102 = Wios = 10. Now
consider what happens to its moving average in the neighborhood of the
run. Let the span be large relative to the run, say, P = 17. Then
the moving average has a run (less pronounced and more tapered)
19 periods long — that is to say, from the time that the righthand end
of the span includes U101 to the time that its lefthand end includes wi 03 .
A moving average of a moving average of a random series oscillates
even more.
These simple properties are vital for the statistical analysis of
business cycles.
In the first place, the economic system itself operates somewhat like
a moving average of random shocks: consumers, businesses, govern
ments get buffeted around by random external and internal impulses,
such as weather, a rush of orders, a rash of tax arrears; the economy
takes most of these things in its stride; it does not adjust instantane
ously and completely to the shocks, but rattier cushions and absorbs
them over considerably larger spans than their original duration.
The Slutsky proposition accounts for business oscillations as the result
of averaging random shocks.
In the second place, even if the economic system itself does no
averaging, statisticians do. The national income, price indexes, and
other data in all the fact books are averages or cumulants of one sort or
another, frequently over time. Such data would exhibit oscillations
even if the economy itself did not.
Finally, analysts who use the moving average technique (on other
wise flawless data from an economy that is innocent of averaging)
cither for detrending or for any other purpose may themselves intro
duce oscillations into their charts and so generate a business cycle
where none exists.
12.6, Correlograms and kindred charts
According to the Slutsky proposition, if we want to analyze a time
series we shall be well advised to leave it unsmoothed and try some
direct attack.
It is natural to ask first whether a given trendless time series \(t) is
oscillating or periodic. In the nonstochastic case the question can be
12.6. CORRELOGRAMS AND KINDRED CHARTS 175
quickly settled by the unaided eye, detecting faithful repetition of a
pattern, however complicated. In the stochastic cases the faithful
repetition is obscured by the superimposed random effects a:; f i Iheir
echoes, if any.
Define serial correlation of order 6 as the quantity
r \ = cov fa,x t e)
(yarxtvarxto)**
A correlogram is a chart with on the horizontal axis and p(0) or its
estimate r(0) on the vertical. A strictly periodic time series has a
periodic correlogram with always the same silhouette and the same
periodicity. If the former is damped, so is the latter. A moving
average of random terms has a damped (or damped oscillating) cor
relogram of no fixed periodicity. A nonexplosive stochastic auto
regressive system like (129) is a damped wave of constant periodicity.
Correlograms are not foolproof. They may or may not identify
correctly the type of model to which a given time series belongs. For
instance, if the random term in (129) is relatively large, the correlogram
of x(t) will compromise between the strictly periodic silhouette of the
exact autoregressive system <xix(t — !)+•••+ <xnx(t — H) and the
nonperiodic silhouette of the cumulated random terms ur f ciuni +
• • • f a H ~ l u\. In general, it will neither damp progressively nor
exhibit any fixed periodicity. This is very unfortunate, because, from
a priori theory, we expect to meet such time series often in economics.
Businoss cycle and stock market analysts are often interested in turn
ing points in a series and in forces bringing about these turning points
rather than in the amplitude of the fluctuations. This leads naturally
to periodograms. To take an example from astronomy, imagine that
the time series \(t) measures the angle of Mars and Jupiter with an
observer on earth. We know this series to be analyzable into four
components: the revolutions of Earth, Mars, and Jupiter round the
sun plus the minor factor of the earth's daily rotation. Periodograms
are supposed to show, from evidence in the time series itself, the four
relevant periods fii = 3G5.26 days, fi 2 = 687 days, Q 3 = H86 years,
and ft 4 = 24 hours. This is a relatively easy matter if the series is
nonstochastic, if we know beforehand that only four basic periods are
involved, or both. The composite series fluctuates and undergoes
accelerations, decelerations, and reversals occasioned by the move
176 TIME SERIES
mcnts of its four basic components. All this is captured by the
formulas
2irs
A « y X(8) cos —
where 12 is an unknown period. The periodogram is a chart with 12
on the horizontal and S 2 on the vertical axis. The value S 2 — A 2 + B 2
attains maxima when 12 takes on the values Qi, 122, 123, 124. The tech
nique works fairly well if x(/) is indeed composed of periodic (trigo
nometric) terms and a random component. It works very badly
when x(0 is autoregressive, because the echoes of past random dis
turbances are of the same order of magnitude as the smaller periodic
components of x(t) = ct\x(t — 1) f • • • f aux{t — H) and claim the
same attention as the latter in the formula for S 2 . Like the cor
relogram, the periodogram fails us where it is most needed, that is,
in the analysis of an economic time series which we know to be auto
regressive and stochastic though we know nothing about the number
and size of its 12s.
12.7. Seasonal variation
The easiest periodic components to measure and allow for are those
tied to astronomy. We know that the cycle of rain and shine repeats
itself every 365 days, and we would naturally expect this to be reflected
in any time series having to do with swim suits, umbrellas, or number
of eggs laid by the average hen. The same is true of cycles imposed by
custom or by the state, for instance, the sevenday recurrence of
Sunday idleness, the Christmas rush, the preference of employees for
July holidays. In all these cases the period itself is known, although
it may be complicated by moving feasts, the varying number of days
in a month, and the occasional occurrence of, say, a short month
containing four Sundays plus Easter or a Friday the thirteenth. The
problem here is not to find the seasonal period but its profile.
It is one thing to recognize and measure the seasonal profile and
another to remove it. Sometimes we want to do the former, some
times the latter, depending on our purpose.
12.7. SEASONAL VARIATION 177
If the purpose is to forecast cycles and trends, it is a false axiom that
a seasonally adjusted series is a better series. The only time we are
justified in taking out seasonal fluctuations is when we believe that
businessmen know there is seasonality, expect it, and adjust to it in
a routine way, either consciously in a microeconomic way or in their
totality when many millions of their microeconomic decisions interact
to form the business climate. So, for forecasting purposes, it is
legitimate to wash out seasonal movements only when they are washed
out of the calculations of consumers and businessmen. If a seasonal
exists but people have not detected it, it should be left in. For
instance, if it were true that the stock market had seasonal properties
unknown to its traders, they should not be corrected for, because the
participants mistake these for basic trends and react accordingly.
Conversely if the relevant people think there is a seasonal when in
fact none exists, its imagined effect should be allowed for by the fore
caster of trends. Suppose, as an example, that the market believes
that the U.S. dollar falls in the summer relative to the Canadian and
rises in the winter. This imagined seasonal should be taken 'into
account in analyzing the significance of monthly or quarterly import
orders. To deseasonalize every time series may increase knowledge in
all cases, but it increases forecasting accuracy only when the time has
come when the market has learned all the real seasonals and imagines
none where none exist.
Every formula either for measuring seasonals or for removing them
is an implicit economic theory, which may be appropriate for one
economic time series and inappropriate for another. For Instance,
treating the seasonal as an additive factor implies that a given absolute
deviation from some normal or trend is equally important In all
months. This is false in the case of, say, housingconstruction starts
in Labrador; the average number of these is, let us assume, 4 in
December and 50 in July. Then 5 starts in December is a more
serious departure than 51 in July. However, many analysis use
additive seasonals for each and every time series.
If the genuine seasonal period is 12 months, its profile can be approxi
mated by averaging the scores of several Januaries, then several
Februaries, etc. This technique gives a biased estimate of the seasonal
profile if the time series is autoregressive, unless random disturbances
12 months apart are. independent. To see this, take (for simplicity
178 TIME SEMES
only) the onelag autorcgressive model
x(i) = ax(t — 1) + 7 sin ^ + u t
and let represent the first January and 12 the following one. For sim
plicity, let us average the values x(0) and a: (12) of just Januaries.
Then we have
z(12) = a 12 x(0) + a l2 Ui + a ll u 2 +•••■+ otUn + W12 + 7 sin 2ir
which involves a moving sum of random terms, and this sum oscillates,
as we already know from Sec. 12.5. The oscillation due to the random
term will be confounded with the amplitude of the true seasonal.
This will manifest itself in two ways: either the seasonal will seem to
shift or, if it does not shift, it will contain the cyclical properties of the
cumulated random effects.
12.8. Removing the trend
Ultimately, economic theory and not the facts tell us whether the
trend (or longestterm movement) is linear or otherwise. If we obtain
the trend as what is left after cycles and seasonals have been taken out,
the trend inherits all the diseases and pitfalls of the seasonals.
In particular, if we use a moving average to obtain the trend, we are
almost certain to get it wrong. To see this, suppose that we have a
trendless cyclical and stochastic phenomenon, say
x t = sin — f u t
depicted in Fig. 24. If the span P of the moving average is longer
than the true period 0, then the moving average (dashes in Fig. 24)
exaggerates the oscillations and imposes a long wavy trend where none
existed. Or again, if the system is autoregressive and trendless,
s(0  <*ix(t  1) ■+ • • • + otnx(t  H)+u t
the moving average of the random term contributes its oscillations to
the systematic ones and, by the same process as that shown in Fig. 24,
imposes a long, wavy trend. Naturally, distortions like these arise
12.8. REMOVING THE TREND
1?9
when x(t) truly contains some systematic trend. Moving averages
distort both the trend and the cycles.
The variate difference method eliminates trends on the ground that
any trend can be approximated by a polynomial of some degree N
and that such a polynomial can be brought down to zero after N + 1
differentiation??. Therefore, let
x{t)  7o + yit+ • • • + yst" + f(t) + ut (1240)
where f(t) and u t are the cyclical and random factors. The method
Fig. 24
proceeds as follows:
1. Difference (1210) once:
x(t  1)  70 + 7i(*  1) + • • '
+ y*t N ~ l + f(t  l) + w< t <isii)
2. Subtract (1211) from (1210) and call y(t) the new variable
x(t) — x(t — 1). We do not need to write out y(t) in full Ml fiote
only that its trend is a polynomial of one degree less than th§ poly
nomial in (1210) and that its random component is
v t = u t  u t i (tS12)
3. Do the same for y(t), and define z(t) = y(t) — y(t — 1); this too
reduces the power of the trend and generates a random component
w t
v t — v t i = u t — 2ut\ + Ut2
(1213)
4. Continue in this fashion as long as the estimated covafiances
m XX) rriyy/2, m„/6 decrease. (The correcting denominators are dis
cussed below.)
To see what is going on, consider the first quadrant of Fig. 25, whore
180
TIME SERIES
x(t) was taken to be a seconddegree polynomial of L Then y(t) is a
sloping straight line, and z(t) is a level one. The variance of x(t) is
quite high, because x assumes many widely different values as t changes.
The variance of y(t) is smaller, because, though y varies, it varies more
smoothly than x. And z does not vary at all. The variate difference
method reduces tho trend to z, and any remaining variation in the
resulting scries must be due to nontrend components.
Several things are wrong with this method. First, if x extends to
the second quadrant of Fig. 25, say, symmetrically, its covariation
with its lagged values may be very small or even zero. And, in general,
Fig. 25
a highdegree polynomial, because it twists and turns up and down,
may exhibit a smaller lag covariance than a lowdegree polynomial.
Hence we should faithfully carry on successive differencing in spite of
a drop in the series m xx , m vv , m M . But suppose we do. How are we to
tell when the polynomial of unknown degree has finally died down?
For meanwhile, as (1212) and (1213) show, we are performing
moving averages of the cyclical component and, for all we know, this
component may increase or decrease. Finally, the variate difference
method cannot come to any stop if its cyclical component has a short
lag. For instance, the first differences of 1, — 1, 1, — 1, . . . are
2, —2, 2, —2, . . . , and tho first differences of the latter are 4, —4,
4, —4, and so on.
Now a word about the correcting denominators. If u t itself is serially
12.9. HOW NOT TO ANALYZE TIME SERIES 181
uncorrelated, then, from (1212), the variance of v t is twice that of u tf
since
var v t = cov (u t u t )  2 cov (w<,W/i) + cov (u,_i,u«_i)
= cov (u h u t ) H cov (u t i,u t .i) ■ 2 COV (f#t«i)
Similarly,
var w t = var (w*  2w<„i + W/2)
= var u t + 2 var w<_i + var w<_j = var t*i
and so on for higherorder differences.
In my opinion, all these methods for detecting or eliminating the
trend have serious imperfections. The way out is, as usual, to specify
the algebraic form of the trend, the number of cyclical components
acting on it, to make stochastic assumptions, and to maximize the
likelihood of the sample. The procedure is very laborious; it is
generally biased, but efficient. I think it represents the best we can
ever do, and I am condemning the other methods only if they are
pretentiously paraded as scientific. I do admit them as approxima
tions to the ideal.
12.9. How not to analyze time scries
The National Bureau of Economic Research has attracted a great
deal of attention with its largescale compilation and analysis of
business cycle data. The compilation is done with such care, tenacity,
and love as to earn the gratitude of all users of statistics. The analysis,
however, has often been questioned. It proceeds roughly as follows:
1. Define a reference cycle for all economic activity. This is a
conglomerate of the drift of several time series, accorded various
degrees of importance.
2. Remove seasonal variations from the given series, say, carload
ings or business failures.
3. Divide the given series into bands corresponding to the reference
cycle.
4. Within each band express each January reading as a per cent of
the average January in the band, and so on to December.
5. In each of the resulting specific cycles recognize nine typical
182 TIME SERIES
positions or phases. The latter may be widely spaced, like an open
accordion, in a long specific cycle or tightly in a short one. The result
is now considered to be the business cycle in carloadings, and constitutes
the raw material for forecasting, for computing the irregular effects,
and for checking whether the given series can be said to have its typical
periodicity, amplitude, etc. There are variations of the procedure,
some ad hoc. After what I said earlier in the chapter about the
pitfalls of time series, I shall not make any further comment on the
National Bureau's method. Recently, electronic computations have
been programmed, mainly for removing the seasonal. 1 As they
involve the use of several layers of moving averages, they are not
altogether safe in the hands of an analyst ungrounded in mathematical
statistics; since, however, the seasonal is the least likely to cause harm
(after all, the period is correct), we may set this question aside.
12.10. Several variables and time series
In Sees. 12.1 to 12.9 we have considered variables that move in
time subject to shocks and to laws of motion unconnected with any
other variables. It hardly needs stressing that endogenous economic
variables are not of this kind, since all of them are generated jointly by
the workings of the economic system. One wonders of what use is the
analysis of individual time series despite the heavy apparatus of
correlograms, pcriodograms, and variate differences.
Assuming that several economic variables hang together structurally,
what kinds of time series do they manifest? Sections 12.11 and 12.12
discuss this problem. If several economic variables are unconnected,
how docs a given combination of them behave? The answer to this
question (Sec. 12.13) provides a null hypothesis for judging the effec
tiveness of averages, sums, and a variety of business indicators, like
the National Bureau of Economic Research "cyclical indicators" and
"diffusion indexes" (Sec. 12.13). The converse problems are also of
great importance to the progress of business cycle research, because
consideration of individual time scries may enable us to infer the nature
of the economic system without laboriously estimating each structural
equation by the methods of Chaps. 1 to 9.
1 See Julius Shiskin, Electronic Computers and Business Indicators (Occasional
Paper 57, New York: National Bureau of Economic Research, 1957).
12.11. TIME SERIES GENERATED BY STRUCTURAL MODELS 183
12.11. Time series generated by structural models
What kinds of time series are generated when the two variables x
and y are structurally related? We shall take up this question first for
nonstochastic relations and then for stochastic relations under various
simplifying assumptions. All our models will be complete.
If the model is completely nonlagged, like the usual skeleton business
cycle model
r.cif <"">
with investment taken as exogenous, then the time series for consump
tion and income have the same shape as the series for investment, as
can be seen from the reduced form:
(1  p)C  a + pi
(i«r« + /
In this example the agreement is not only in the timing of turns but
in the phase as well, because investment, consumption, and income are
positively related. In a more extended model
C = a + pY
I « y + & (1215)
Y « C+/+G
where investment is endogenous, government expenditure is exogenous,
and investment is discouraged by the latter (5 > 0), all time series will $
coincide on timing; but when G grows / falls, and C and Y will fall if 5
is less than — 1 and will rise if it is greater.
If (1214) and (1215) are made stochastic, all endogenous variables
absorb some of the random disturbances. The random disturbances
apportion themselves, one year with another, according to a fixed
pattern among the endogenous variables. For instance, if u is the
random disturbance of the consumption function and v of the invest
ment function, the reduced form of (1215):
(1  p)C = (a + py) + ((3 + p8)G + U+ $V
(1  P)I = (t ~ Py) + («  P8)G + (1  P)v (1216)
(1  p)Y = (a + 7 ) + (l + W + « + v
184 TIME SERIES
shows that the fluctuations and irregular components are in step,
though they may differ in their amplitudes.
The variances of the three irregular components in (1216) are
proportional respectively to <r uu + 2/?<r uv + 0V™, (1 — P)<rw, and
<r uu + 2<r U v + (Tvv Thus, if the two random terms are positively
correlated (<r uv > 0), income wobbles more than consumption and
consumption more or less than investment, depending on the size of
the marginal propensity to consume.
Let us now consider as a recursive model the market for fish. The
men go to their boats with today's price in their minds, expecting it to
prevail tomorrow, and work hard if the price is high. Thus tomorrow's
supply depends on today's price plus weather (z). Should the price
fall, the fishermen don't put the fish back into the sea; so at the end of
the day all the fish is sold. Demand is ruled by current price only.
d = a + 0p + u
S » y + dpi + €2 + V (1217)
s = d
The model can be solved for p as follows:
a + fip t f u t = 7 + Spti + €2< + v t
which shows that price tends to zigzag (0 negative, 8 positive), falling
with good weather and rising with bad, as we might expect. In
(1217), unlike case (1210), the irregular components of the price and
quantity time series are no longer constant multiples of each other, nor
are they in step. This is so because randomly overeager demand
{u > 0) affects not only today's price but, through its effect on the
fishermen's efforts, contributes to a fall in tomorrow's price as well.
The connections among phase, amplitude, and irregularity in
structurally related time series become very complicated as we increase
the number of variables and as we admit more and more lags and cross
lags. In any representative set of economic time series it would
Indeed be a marvel if closely similar patterns emerged, except between
such series as sales of left and of right shoes. And yet the marvel
scorns to happen.
12.12. THE OVERALL AUTOREGRESSION OF THE ECONOMY 185
12.12. The overall autoregression of the economy
Regardless of which came first, chickens and eggs in the long run
have similar time scries, because there can bo no chicken without a
previous combination of egg and chicken and there can be no egg
without a previous chicken. Since the hatching capacity of a hen is
fixed, say, 10 chicks per hen, and since the chickenproducing capacity
of an egg is also fixed, say 1 to 1, the cycles in the egg population and
in the hen population cannot possibly fail to exhibit a likeness — though,
in particular shortrun instances, random disturbances like Easter or a
fox can grievously misshape now the one, now the other series. Orcutt
has claimed 1 that something like this is true of the time series of the
economy's endogenous variables. He states that the autoregressive
relation
x t+ i = l.Sxt  O.Sxti + u t +i
fairly describes the body of variables used by Tinbergen in his pioneer
ing analysis of American business fluctuations. 2 Orcutt's result, if
correct, would not exactly spell the end of structural estimation of
econometric models, because the latter may bo more efficient, less
biased, etc. However, if a correct autoregression were discovered, it
would certainly shortcircuit a good deal of current research.
Orcutt's theorem holds only for systems whose exact part, by itself,
is stable and nonexplosive. Orcutt also found that we can get better
estimates of the overall autoregression if we consider many time series
simultaneously than if we consider them one at a time. This follows
from the fact that Easter and foxes descend on eggs and hens inde
pendently, so that a grievous random dent in the egg population tends
to be balanced by the relative regularity of the hen population.
In the absence of random shocks, all the interdependent variables
have the same periodicity but different timing, amplitudes, and levels
about which they fluctuate. With random shocks, the periodicities
are destroyed more or less depending on the severity of the shocks and
their incidence on particular variables. The unaided eye can seldom
1 Reference ia in Further Readings at the end of the chapter.
2 Jan Tinbergen, Statistical Testing of BusinessCycle Theories (Geneva: League
of Nations, 1939).
186 TIME SERIES
recognize the true periodicity. A highly sophisticated technique can
screen out the autoregressive structure by combining observations
from all time series, but it is so difficult to compute that one might as
well specify a model in the ordinary way.
Exerciser
12.A Let g(t) and k(t) be the population of gnus and of kiwis.
Let fa and 5» be the agespecific birth and death rates for gnus and aj
and 7/ for kiwis. Disregard the question of the sexes. Let e and £
stand for inputoutput coefficients expressing the necessary number
of kiwis a gnu must eat to survive, and conversely. Construct a
model of this ecological system. Do something analogous for new
cars and used cars.
12. B In a Catholic region, say, Quebec, the greater the number
of priests and nuns, other things being equal, the smaller the birth
rate, because the clergy is celibate. But the more numerous the
clergy, other things being equal, the higher the birth rate of the laity,
because of much successful preaching against birth control. Construct
an ecological model for such a population.
12. C The more people, the more lice, because lice live on people.
But the more lice, the more diseases and, hence, the fewer people.
Construct the model, with suitable life spans for the average louse
and human.
12.D According to the beliefs of a primitive tribe, lice are good for
one's health, because they can be observed only on healthy people.
(Actually the lice depart from the sick person because they cannot
stand his fever.) Construct this model and compare with Exer
cise 12,0.
12.13. Leading indicators
An economic indicator is a sensitive messenger or representative of
other economic phenomena. We search for indicators in the same
spirit in which pathology examines the tongue and measures the pulse:
for quickness, cheapness, and to avoid cutting up the patient to find
out what is wrong with him.
A timing indicator is a time series that typically leads, lags, or
coincides with the business cycle. Exactly what this means will
occupy us later. We shall deal only with the leading indicators.
12.13. LEADING INDICATORS 187
From what was said in Sec. 12.12, it comes as no surprise that certain
economic time series, like Residential Building Contracts Awarded,
New Orders for Durable Goods, and Average Weekly Hours in Manu
facturing, should have a lead over Disposable Income, the Consumer
Price Index, and so forth. The difficult questions are (1) how to
insulate the cyclical components of each series from the trend, seasonal,
and irregular; (2) how to tell whether leads in the sample period are
genuine rather than the cumulation of random shocks; and (3) where
phases are far apart, how to make sure that carloadings lead disposable
income and not conversely, or that the Federal discount rate does lead
and direct the money supply and not try belatedly to repair past
mistakes. I am sure that, ultimately, one has to fall back on economic
theory; one is forced to specify bits and pieces of any autoregressive
econometric model, because no amount of mechanical screening of the
time series themselves can answer the third question convincingly.
In 30 years of research the National Bureau of Economic Research
has isolated about a dozen fairly satisfactory leading indicators out of
800odd time series. 1 I think, however, that in nearly all cases,
a priori considerations would have led to the selection of these leading
series without the laborious wholesale analysis of hundreds and
hundreds of time series. For instance, Average Hours Worked in
Manufacturing is a good candidate for leading indicator of manu
facturing activity because we know from independent observation
that it is easier for a business establishment to take care of a moderate
increase in orders by overtime than by hiring new workers and easier to
tide over a lull by putting its workers on short time than by laying some
off at the risk of losing them. All the sensible leading indicators
thrown up in the National Bureau's screening are obvious in a similar
way. An oddity like the production of animal tallow, which is said
to lead better than many other series, could not have been discovered
by a priori reasoning, but neither is it used by any sane forecaster, for a
good empirical fit is no substitute for a sound reason.
Part of the findings of the National Bureau are, I think, tautological,
because the timing indicators lead, lag, and coincide not with §ach
other individually, but with the reference cycle, which is an index of
1 See Geoffrey H. Moore, Statistical Indicators of Cyclical Revivals and Hefem'ons
(Occasional Paper 31, New York: National Bureau of Economic Research, 1950),
particularly chap. 7 and appendix B.
188 TIME SERIES
"general business activity." The latter is a vague conglomerate of
employment, production, price behavior, and monetary and stock
market activity; therefore, it is no wonder at all that some series lead,
some coincide with, and others lag behind it. The reference cycle is a
useful summary, but we should not be misled into existential fallacies
about it.
12.14. The diffusion index
A diffusion index is a number stating how many out of a given set of
time series are expanding from month to month (or any other interval).
Diffusion indexes can be constructed from any set of series whatsoever
and according to a variety of formulas, of which I shall discuss just
three.
There are two reasons why one might want to construct a diffusion
index. One is the belief that a business cycle starts in some corner of
the economic system and propagates itself on the surrounding territory
like a forest flro. This says in effect that the diffusion index is a cheap
shortcut ftutorcgrcssive econometric model. The second reason is
that the particular formula used to construct the index captures in a
handy way the logic of economic behavior.
Three different formulas have been suggested for the diffusion index:
Formula A Per cent of the series expanding
Formula B Per cent of the series reaching turns
Formula C Average number of months the series have been
expanding
Research by exhaustion argues that we ought to try all these
formulas on all time series and choose the formula that gives prag
matically the best results. This can be done quite cheaply on the
Univac. I think such a procedure will frustrate our search for good
indicators, because each formula embodies a different theory of
economic behavior, not universally suitable.
Formula A is justified by the classical type of business cycle, where
income, employment, prices, hours, inventories, production, and so on,
and their components move up and down in rough agreement or with
characteristic lags. Suppose, however, that the authorities control
totals — employment, some price index, credit, or the balance of
12.14. THE DIFFUSION INDEX 189
payments. The result is " rolling readjustment" rather than cycles.
Formula A has lost its relevance. In a world of rolling readjustment
this formula will show an uneventful record and will not be able to
indicate, much less predict, sectional crises hiding under a calm total.
Formula B is justified if consumers and business are more sensitive to
turns, however mild, than to accelerations, however violent. Invest
ment plans are likely to be of this kind. As long as there is expansion*
in demand, any overexpansion will be made good eventually. If there
is contraction, however small, the mistake is more obvious, and panic
may easily result. On the other hand, there are many areas in both
the consumer and the business sectors where small turns are not taken
seriously. Formula B, therefore, can be used to best advantage in
studying certain investment series (like railways' orders of rolling
stock) but is counterprescribed elsewhere.
Formula C gives great emphasis to reversals. Take a component
that has been slowly expanding for some months, then turns down
briefly. Formula C registers (for this component) I? 2, 3, etc, up to a
larjjjc positive number, then —1 (for the first month of contraction),
The more sustained the expansion, the more violently does the formula
register a halt or small reversal. This formula, then^ Is appropriate
where habit and momentum play an important part. Where could
we possibly want to apply it?
Hirepurchase may be related to disposable income Sn some way
that agrees with the logic of formula C. Suppose that small increases
in income go into down payments and time payments iof morci and
more gadgets; if so, a small fall in income would put a complete stop
to new hirepurchase, because the family would continue tti oontraottial
time payments on the old gadgets and would not be llkc'o to mit
into food, clothing, and recreation to buy new gadgets, This Is a
theory of consumer behavior, and formula C is a convenient way to
express it short of an econometric equation.
Exercises
12.E Construct diffusion indexes by each formula from the two
time series below:
Series 1
100,
96,
90,
96,
97,
95,
97
Series 2
100,
99,
102,
100,
100,
101,
10?
190 TIME SERIES
and compare the cyclical behavior of the indexes with that of the
sum of the two series.
12.F In Exercise 12.E, series 2, replace the 99 by 101, and con
struct formula A. Must turning points in the sum be preceded by
turning points in the index?
12. G Construct an example to show that an index according to
formula C can be completely insensitive to the sum of the component
series.
12.11 Show by example the converse of Exercise 12.G, namely,
that swings in the d illusion index formula C need not herald turns
(or any change whatsoever) in the sum of the component series.
12.15. Abuse of longterm series
One unfortunate byproduct of time series analysis is that it requires
longtime series with which to work, and several research organizations
have responded enthusiastically to the challenge.
For example, I have heard urgings that we construct Canadian
historical statistics for the purpose of sorting out timing indicators, on
the ground that what took the National Bureau 30 years can now be
done in 30 hours electronically. I think this kind of work quite futile,
for a few moments' reflection will convince us of its negative results.
The Canadian economy, compared with the American, is small and
relatively unbalanced; therefore, Canadian historical statistics will
have a very large irregular component, which will overwhelm the lino
structural relationships we want to uncover. The Canadian economy,
being open, responds to impulses from abroad; therefore, even if we
had good domestic historical time series, our chances of finding among
them good indicators are slim. We also know that the Canadian
economy is "administered" (it has more governments per capita than
we have and moro industrial concentration); so the developments that
are foreshadowed by the indicators are likely to be anticipated by the
big policy makers, with the result that predictions go foul. We know
that Canada is and will be growing fast and that the past (on which all
indicators rely) will not be a dependable guide*
My guess is that the earliest useful year for time series on bread
baking is somewhere around 1920. For iron ore shipments it is 1947,
the year when certain Great Lakes canals were deepened. However,
12.16. ABUSE OF COVERAGE 191
for housing demand as a function of family formation, many decades
or even centuries might prove to contain valid information.
There are many good reasons why we might want to construct
uniformly long historical statistics, but certainly the needs of cycle
forecasting is not one of them.
12.16. Abuse of coverage
An unfortunate byproduct of diffusion index analysis is that it
encourages the construction of complete sets of data when incomplete
ones would be more satisfactory. This is so because the timing and
irregular features of the diffusion index change with the number of
scries included in it.
Let us suppose that we want to forecast industrial production by
means of average weekly hours worked; the series rationalised in
Sec. 12. 13 is a possible leading indicator. If hours worked come broken
down by industry, we suspect we might do better if we use a diffusion
index of the basic series rather than the overall average.
Now our first impulse is to look at the published series for Hours
Worked and make sure that they give complete coverage by industry
and by locality and that the series have no gaps in time. After all,
we want to forecast for all industry and for the entire country. Yet it
is unreasonable to desire full coverage.
First, some industries employ labor as a fixed, not a variable, input.
A generating station, if it is operated at all, is tended by a switchman
24 hours a day, regardless of its output. Labor is uncorrelated with
output. Here is a case where coverage does harm to our forecast,
because it introduces two uncorrelated variables on each side of the
scatter diagram, so to speak.
Second, in the service industries, the physical measure of output w
labor input, because this is how the compilers of government Statistics
measure the production of services. If we insist on coverage of the
services, we get trivial correlations, not good forecasts.
Third, during retooling it is possible to have long working hours and
no industrial production. What should we do? Throw out entirely
any industry that has retooling periods? Not at all. It is enough to
suppress temporarily from consideration the data for this industry
until the experts tell us that retooling and catching up on backlog are
192
TIME SERIES
over. A deliberate time gap in the statistics improves them. This
method, though it appears to be wasting information, actually uses
more, for it includes the fact that there has been a retooling period.
The fact that the diffusion index is dehomogcnized is a flaw of a second
order of importance.
In this way we select statistics of average hours worked to use in
forecasting industrial production which are a statistician's nightmare:
they have time gaps, they are unrepresentative, and they do not
reconcile with national accounts Labor Income when they are multi
plied by an average of wage rates.
Similarly, for statistics that are most useful in forecasting, it is
not necessary that they be classifiable into grand schemes, such as
the National Income, Moneyflows, or InputOutput Tables. The
Canadians plan to start compiling data on Lines of Credit agreed
upon by chartered banks and their customers but not yet credited to
the customer's account. Such a series, I think, will prove a better
predictor than the present one, Business Loans. Now, if we had
information on Lines of Credit, it would not fit any existing global
scheme and would not become any more useful if it did. For forecast
ing purposes, I see no excuse for creating a matrix of Intersector
Contingent Liabilities or for constructing a Balance of Withdrawable
Promises account.
12.17. Disagreements between crosssection and time
series estimates
It is very puzzling to find that careful studies of the consumption
function derived from time series give a significantly larger value for
the marginal propensity to consume than equally competent studies of
crosssection data. Three kinds of explanations are available: (1)
algebraic and (2) statistical properties of the model explain the dis
crepancy; and (3) crosssection data and time series data measure
different kinds of behavior. We shall concentrate on explanations
1 and 2 in order to show that algebra and statistics alone account for
much of the difference and that to this extent explanations of the third
category are redundant.
Crosssection data are figures of income, consumption, etc., by
individual families in a given fixed time period. Time series are data
12.17. CROSSSECTION AND TIME SERIES ESTIMATES 193
about a given family's consumption and income through time or about
national consumption and income through time.
Algebraic differences
The shape of the consumption function can breed differences. If th©
family consumption function is nonlinear, say,
C = a + 0y + yy 2 + U (1218)
then the consumption function connecting average income av y and
average consumption av c or total income Y and total consumption C
will look different from equation (1218), even if all families have the
same consumption function and if the distribution of income remains
constant. To see this, take just two families,
ci = a + ft/i + y(yi) 2 + Ui
d = a + ft/a + 7(2/2) 2 + u 2
add together and divide by 2 to get
avc = a + /?av2/ + 27(av y) 2 — 72/11/2 + avu (1219)
and, in general, with N individuals,
avc = a + j3av2/ + Nyfav y) 2 — 7 Y Mi + av « (1220)
One might argue that, when income distribution remains unchanged,
the cross term Zyflj remains constant and is absorbed into the estimate
of a. But this is false, because the cross term appears in (1220)
multiplied by 7, another unknown parameter, whose estimate is bound
up with the estimates of a and p in the least squares (or other) formulas.
The discrepancy between (1220) and (1219) affects the estimates of
all three parameters a, ft and 7 if no allowance is made for the extra
terms of the average consumption function. The last two terms of
(1220) are equal to
» — i
that is to say, the raw moment m[ u of the family incomes. It follows
that, for time series and crosssection studies to give agreeing results,
194 TIME SERIES
the average (or the total) consumption function must contain a term
m/y expressing inequality of income, even if this inequality should
remain unchanged from year to year. Moreover, neither the sample
variance of av y nor the Pareto index is suitable for the correction in
question.
If income distribution varies with time, to get the two approaches
to agree our correction must be more elaborate, because the factor
i
ViVi
must be calculated anew for each year of the data. If we have no
complete census of all families, a sample estimate of 2 ?/,•?/,• will be better
than nothing. If a census of families exists but we are in a hurry,
again we can approximate Zysjj to any desired degree by taking the
families in large income strata.
Statistical differences
Let us assume that the consumption function of a family is linear
and constant over time and that it involves another variable x,
reflecting some circumstance of the family, like age.
c = a + fry + yx + u (1221)
However, let the characteristic z, as time passes, have a constant
distribution among the several families. For example, in a stationary
population, the ages of the totality of families remains unchanged,
although the age of any given family always increases. If we aggregate
(1221) we get
avc = af/3av?/f7ava;ravw (1222)
but x, being a constant, is absorbed into the constant term when we
estimate (1222). Not so if we trace the history of one such family by
estimating (1221) from time series.
In practice there is a further complication: the characteristic x is
not independent of the family's income; thus, /§ and $ are shaky esti
mates in (1221) because of multicollinearity. This is an additional
reason why time series and cross sections disagree.
Thus, we do not need to go so far afield as to postulate several kinds
of consumption functions (longterm, shortterm) to explain these
discrepancies. If, after we have corrected for the algebraic and
FURTHER READINGS 195
statistical sources of discrepancy, some further disagreement remains
unexplained, that is the time for additional theories.
Further readings
Kendall, vol. 2, devotes two lucid chapters to the algebra and statistics
of univariate time series.
The proof that ignoring the serial correlation of the random term in a single
equation leaves least squares estimates unbiased and consistent can be found
in F. N. David and J. Neyman, "Extension of the Markoff Theorem on Least
Squares" (Statistical Research Memoirs, vol. 2, pp. 105116, December, 1938).
How to treat serial correlation is discussed by D. Cochrane and G. H.
Orcutt, "Application of Least Square Regression to Relationships Containing
Autocorrelated Error Terms" (Journal of the American Statistical Association,
vol. 44, no. 245, pp. 3261, March, 1949).
Eugen Slutsky, "The Summation of Random Causes as the Source of
Cyclical Processes" (Econometrica, vol. 5, no. 2, pp. 105146, April, 1957),
is rightly famous for its contribution to theory and its interesting experi
mental examples with random series drawn from a Soviet government lottery.
Correlogram and periodogram shapes are discussed in Kendall, vol. 2,
chap. 30.
The brief discussion of autocorrelation, with examples, in Beach, pp. 17&
180, is simple and useful.
The early article by Edwin B. Wilson, "The Periodogram of American
Business Activity" (Quarterly Journal of Economics, vol. 48, no. 3, pp. 375417,
May, 1934), is both ambitious and sophisticated.
Tjalling C. Koopmans, in his review, entitled "Measurement without
Theory, " of Arthur F. Burns and Wesley C. Mitchell's Measuring Business
Cycles (Review of Economic Statistics, vol. 29, no. 3, pp. 161172, August, 1947),
delivers a classic and definitive criticism of some investign tors' avoidance of
explicit assumptions. All wouldbe chartists should read it. Koopmans also
gives, on p. 163, a summary account of the National Bureau method for
isolating cycles.
J. Wise, in "Regression Analysis of Relationships between Autocorrelated
Time Series" (Journal of the Royal Statistical Society, ser. B, vol. 18, no. 2,
pp. 240256, 1956), shows that, in recursive systems of two or more equations,
least squares is biased both when the random terms of the separate equations
are interdependent and when the random term of either equation is serially
correlated.
The reference of Sec. 12.12 is G. H. Orcutt, "A Study of the Autoregressive
Nature of the Time Series Used for Tinbergen's Model of the Economic
System of the United States 19191932," with discussion (Journal of the
Royal Statistical Society, ser. B, vol. 10, no. 1, pp. 153, 1948). Arthur J.
Gartaganis, "Autoregression in the United States Economy, 18701929"
196 TIME SERIES
(Economctrica, vol. 22, no. 2, pp. 228243, April, 1954), uses much longer time
series and concludes that the overall autorcgressive structure changed
drastically around the year 1913. Gartaganis uses six lags.
I have discussed the mathematical properties of the diffusion index in
"Must the Diffusion Index Lead?" (American Statistician, vol. 11, no. 4,
pp. 1217, October, 1957). Geoffrey Moore's comments are on pp. 1617.
Trygve Haavclmo, "Family Expenditures and the Marginal Propensity
to Consume" (Economctrica, vol. 15, no. 4, pp. 335341, October, 1947),
reprinted as Cowles Commission Paper 26, affords a good exercise in the
decoding of compact econometric argument. Haavelmo deals with the
discrepancies arising from different ways of measuring the consumption
function.
APPENDIX A
Layout of computations
I recommend a standard layout, no matter how large or small the
model or what estimating procedure one plans to use (least squares,
maximum likelihood, limited information) or what simplifying assump
tions one has made. There are three general rules to follow:]
1. Scale to avoid large rounding errors and to detect other errors more
easily. Scaling should be applied in two stages.
a. Scale the variables.
b. Scale the moments.
2. Use check sums.
3. Compute all the basic moments. This may seem redundant, but
is actually very efficient if one wants
a. To compute correlations.
b. To experiment with alternative models.
c. To get least squares first approximations.
d. To select the best instrumental variables.
197
198 APPENDIX A
The rules in detail
Stage 1
Scale the variables. Express all of them in units of measurement
(say, cents, tens of dollars, thousands, billions, etc.) that reduce all the
variables to comparable magnitudes. Scale the units so as to bring
the variables (or most of them) into the range from to 1. For
instance:
National income x\ = 0.475 trillion dollars
Hourly wage rate x 2 = 0.182 tens of dollars
Population x 3 = 0.165 billions
Price of platinum Xa ■■ 0.945 hundreds of dollars per ounce
This, rathei tfian the range 1 to 10 or 10 to 100, is preferred, because
we shall include an auxiliary variable identically equal to 1. Then all
variables, regular and auxiliary, are of the same order of magnitude.
Stage 2
Arrange the raw observations as in Table A.l. Note that the
endogenous variables, the y's, are followed by their check sum Y and
that, in addition to all the exogenous variables t\ % z 2y . . . , z#_i, we
devote a column to the constant number 1, which is defined as the last
exogenous variable 2//. These are then followed by the check sum Z
of the exogenous variables including zh — 1 and by a grand sum
X  Y + Z.
Stage 3
The raw moment of variable p on variable q is defined as
where the sum is over the sample. A raw moment is not the same
thing as the simple moment m pq defined in the Digression of Sec. 1.2.
The simple moment m pq is also called the (augmented) moment from the
mean of variable p on variable q.
Compute the raw moments of all variables on all variables. This
gives the symmetrical matrix m' of moments, shown in Table A.2.
In Table A.2 the symbol m' is omitted, and only the subscripts appear;
for instance, yoyi stands for m' vgUtm
LAYOUT OF COMPUTATIONS
199
Table A.l
Arrangement of raw observations
Endogenous
Check
SUM
Exogenous variables
Check
sum
Grand
Time
VARIABLES
Regular
Aux.
SUM
1
2
3
8
yid) •
Vi(2) ■
yi(3) •
Vi(S) •
• yo(D
• ya(2)
' fr(3)
• yo(8)
YQ)
Y(2)
F(3)
Y(S)
zi(2) •
H(S) •
9l<8) '
• *ifi(2)
• fjfift)
1
1
1
1
WO
2(3)
xm
xm
xm
X(S)
Table A.2
2/i2/i ' '
• yiya
ViY
yi«i 2/i«2 •
• • yiZHi
y\ • 1
ya
mX
■
.
.
.
i
.
.
.
.
*
yayi • •
• ycs/o
yoY
yo2i yo^a •
' • 1/02/fl
ya 1
yoZ
y X
YVx • •
• Yy
YY
K*, Yz 2 •
• • Yzhi
ri
rz
YX
Z\V\  
• Zxya
z\y
Zl2l 2l«2 '
' ' Z\ZH\
fl1
z x Z
SiX
ziyi ' '
 z 2 ya
z 2 Y
Z2Z1 Z&t •
• Z&HX
*,.l
ZiZ
HX
•
'
•
•
•
•
•
•
•
•
Uffiyi* •
' ZHiyo
ZHl
ZH\Z\ ZHlZ* '
• ' ZHlZHl
ZH1'1
2*»Z
ZHlX
— »
1 • V\ • •
•l2/o
i.y
I'Zl 1 • Z% •
• 1 ' 2ffl
11
\z
IX
Zyi • •
• Zy
zr
Z«i £« 2 •
• Zziri
z1
zz
ZX
Xy, • •
• Xy a
xr x«, x« 8 •
• XZHI
XI
XZ
XX
200 APPENDIX A
Stage 4
Compute the augmented moments from the mean of each variable
(except zh — 1) on each variable, e.g.,
m XlXi « Sm XiXi  m Xx . x m\. x%
This is done very easily because m Xi . x and m Xx . x are always on the level
indicated by the arrows and in the row and column corresponding to
This procedure gives a square symmetric matrix m of moments
from the mean. The new matrix contains one row and one column
less than the matrix m\
Stage 5
Rule for check sums. In both m and m' any entry containing a
capital Y (or Z) is equal to the sum of all entries in its row that contain
lowercase y's (or z's). Any entry containing a capital X is equal to
the sum of everything that precedes it in the row.
All these things are true in the vertical direction, since the matrices
m' and m are symmetric.
Stage 6
Scale the moments. This step is not always possible. Scan the
symmetric matrix in. If it contains any row (hence, column) of
entries all (or nearly all) of which are very large or very small relative
to the rest of the rows and columns, divide or multiply the entire
offending row and column by an appropriate power of 10. The purpose
is to make the matrix m contain entries as nearly equal as possible.
When moments have comparable magnitudes, matrix operations on
them are very accurate, rounding errors are small, and calculating
errors can be readily detected.
Keep accurate track of the variables that have been scaled up or
down in stages 1 and 6 and of how many powers of 10 in each stage
and altogether.
.
Stage 7
Coefficients of correlation. These can be computed very easily
from m, but unfortunately the checks do not work in this case. So
LAYOUT OF COMPUTATIONS % 201
drop the check sums and consider only part of m. The saiftpl©
correlation coefficient between, say, the variables y a and zh is
"iviVfirhkn
Coefficients of correlation are used informally to screen out the most
promising models (see Chap. 10 on bunch maps).
Matrix inversion
This is a frequent operation in estimating medium and large systems.
Details for computing
M 1 and MhW
are given in Klein, pp. 15 Iff. There are various clever devices for
inverting a matrix and performing the operation M 1 N. Electronic
computers have standard programs, and it is well to use them if they
are available. If M is small in size and if both M" 1 and M""*N are
wanted, do the following: Write side by side the matrix M, the unit
matrix of the same size, and then N.
[M][I][N]
Then perform linear combinations on the rows of the entire new
matrix [MIN] in such a way as to reduce M to a unit matrix. When
you have finished, you will have obtained
[I][M~ l ][M l N]
For example, let
n/r T2 41 K re 30 sol
We shall trace the evolution of [M][I][N] into [IHM^HM^N],
4
6
(RON). [J 4
1
1
6 30 50
2 1 3
Divide the first row by 2.
(MIN)i[[
2 i y 2
6 I 1
3 15 25
2 1 3
202
APPENDIX A
In this new matrix, subtract row 1 from row 2
j M
(MIN), = [J
Divide the new second row by 2.
(min),* [J ;
3 15
1 14
! 3
Yi
15
7
25]
22j
25
11
Subtract the new second row from the first.
(MIN),
A 1 °
[0 2
% X
k y 2
3H
22 36]
7 — 11 J
Divide the new second row by 2
(MIN),
1
Lo_j.
[i]
H K
22
36 ]
/ ^5^J
[M l N]
[M l l
One can compute a string of "quotients" M 1 N, M _1 P, M _1 Q, etc.,
by tacking on N, P, Q and performing the linear manipulations. This
technique works in principle for all sizes of matrices with more than
three or four rows, but it consumes a lot of paper and time.
APPENDIX B
Stepwise least squares
Estimating the parameters of
V = 7o + 71*1 + 72*2 + 73*8 + ' • • + 7//*// + U
by desk calculator, according to Cramer's rule, or by matrix inversion
is a formidable task when // is greater than 3. The stepwise procedure
about to be explained may be slow, but it has three advantages over
the other methods:
1. It can be stopped validly at any stage.
2. It possesses excellent control over rounding and computational
errors.
3. We do not have to commit ourselves, ahead of time and once and
for all, on how many decimal places to carry in the course of computa
tions, but we may rather carry progressively more as the estimates are
successively refined.
I shall illustrate the method by the simple case
w = ax + &y + 7Z + u
where (in violation of the usual conventions) w is endogenous, and
x, y, z are exogenous. For the sake of illustration let us assume that,
in the sample we happen to have drawn, the exogenous variables are
203
204 APPENDIX B
"slightly" intercorrelated, so that m xv , m XBf m xu , m va , m„ u , m au are
small numbers, although, of course, m xx — m vv = m tt — 1.
Step 1
On the basis of a priori information, arrange the exogenous variables
from the most significant to the least significant, that is to say, accord
ing to the imagined size of the parameters a, 0, 7, disregarding their
signs:
M > 101 > M
Step 2
To estimate a first approximation to a, compute &\ = 7n wx /m xx as an
approximation to the true value. Let &\ *■ a + A%.
StepS
Form a new
variable
V «* W — OL\X
and estimate a
first approximation to P by computing
•
* m vy
m uy
Step!
Form the new variable
s = v  fry
and then compute the first approximation to 7:
■
m„
Step 5
Form the new variable
Wl"« 7i2
and compute
*"/il —
m xx
The idea here is to estimate the error A\ of our first approximation
STEPWISE LEAST SQUARES 205
&i. Compute
at = 5i — Ai
as a second approximation to a.
Step 6
Now that a better estimate of a is available, there is no point in
correcting the first approximations &, 71. We discard them and
attempt to get new approximations &, 72, based on the better estimate
a 2 . We first use a 2 to define a new variable
Vi — w — a 2 x
Note that in this step we adjust the original variable w (not w x ).
Proceed now as in steps 3 to 5 :
P2 == —*
m
vv
si =* vi  (3 2 y
72 = ^
W2 = Si — 722
S3 = «2 — A2
and so on.
The method of stepwise least squares does indeed yield better esti
mates in each round. To see this, consider the steps
 = mwx  a Jb Wtfu+y+y)*
011 m tx ~ m xx
= a + Ax
v = w — <iix = ax 4 /% f 72 + w — ax — Atf
= #2/ + 72 + w  Aix
* m vy niiyz+uAxx)^
" l ^T" = " "+■ ™
Tftyy TTlyy
s = v  &*/ = 0*/ 4 72 4 w  Atf  Py  B x y
= 72 4 w — i4iz — Biy
m 2 . m„
= 74C
206 APPENDIX B
t0i « u — fi« «■ 72 f u — Ais — Biy — 7i* — Ciz
= u — Aire — #iy — Ciz
m zx m xx
«i  fii  JTi  a + ki  At + B±igiHzgi±f
I *W<UllyC,I>.*
= a t
m xx
= a + A
The residual factor A 2 is of smaller order of magnitude than Ai,
and so d 2 is better than <5i as an estimate of a. To see this consider
m xx ?7lxi m xx
Expressing B\ and Ci in terms of Ai,
3 * L m « m « m w m ** ** m « m w ^« J
[#&„* ^*y W«< 77lyg 771,2, I
m xx m vv ra xx m„ m vl/ J
, 7n UI ^_ m vx m U i/ __ m ax m UM . m uv m tx m yt 1
L 7n xx 7n xx rriyy m xx m M w vv m xx m, tu J
Each bracketed term is of small order of magnitude. So, unless y is
numerically very large (which was guarded against in step 1), it
follows that a 2 is an improvement over &i. The same can be shown
for J}*, 72, and <5 3 , compared with &, 71, and 5 2 , respectively.
The method of stepwise least squares can also be used when x, y f and
z are endogenous variables. In this case, although the bracketed
terms are not negligible, they keep decreasing in successive rounds of
the procedure. Had another variable been treated as the independent
one, the stepwise method (like any procedure based on naive least
squares) would, in general, have given another set of results.
APPENDIX €
Subsample variances
as estimators
Consider the model y = a + yz + u, under all the Simplifying
Assumptions.
Let
be the maximum likelihood (and also the least squares) estimate of 7
based on a sample of size S. Its variance is
«i,m = ««  *f)* = ««  yy = «feV
" (mi?i + • • • + ttsza)' "!
 [ M H + 4)2 J
Holding z\, , . . f zs fixed, this reduces to
(*!+••• + 4) 2 *! + • • • + 4 m„
Let us now ask what happens on the average if from the original
sample we obtain its S subsamples (each of size S — 1), if we compute
207
208 APPENDIX C
the corresponding parameters
and if we then compute the sample variance V of these fs.
SV m V [f («)  av f (•)]»  V (B.) 2  § ( V *.)'
e(5F)  SuV m g Y(£.) a   e ( V #.Y
For our fixed constellation of values («i, . . . , zs) of the exogenous
variable, any terms of the form $(%%&&) (t 3^ j) equal zero. By
careful manipulation we obtain
sev  (r ttU r y — j—3 ~ ± y — x — 2
a a
£ Z/ (m ta  z\)(m„  z^J
»<>
So far this is an ezed result. If we make the further assumption that
the values zi, . . . ,z 5 of the exogenous variable are spread "typically,"
the relations
. S 1
m„ — z; = — •$ — m lt
m„  z?  zj = g m„
are approximately true, or at least become so in the course of the
summations given in the last brackets. Therefore,
It follows that (S — l)V is an unbiased estimator of cov (^,^S).
APPENPI^ D
Proof of least squares bias
in models of decay
Let the variable 5« be equal to 1 if time period t is in the sample, and
zero otherwise. The least squares estimate of 7 is
*
y hytyti
j
2 fctfi
The proposition e(^) < 7 will be proved by mathematical induction.
It will be shown true for an arbitrary sample of S = 2 points; then its
truth for £ + 1 will be shown to follow from its truth for any &
Definition. A conjugate set of samples contains all samples having
the following properties :
1. The samples include the same time periods and skip the same
time periods (if any) ; let h be the first period included and ta the last.
2. If time period t is included in the samples, then all samples of the
conjugate set have disturbances u t of the same absolute value. The
disturbances do not have to be constant from period to period.
209
210 APPENDIX D
3. When a time period is skipped by the sample, algebraically equal
disturbances must have operated on the model during the skipped
periods.
4. The samples have come from a universe having, as of t l} the same
values for all predetermined variables.
Consider an arbitrary sample of two points. Let one point come
from period j and the other from period k (k > j) ; the sample can be
completely described by the disturbances operating at and between
these two time periods; that is,
Si = (+ty,ty+i, • • • ,w*_i,+WjO
Si has three conjugates:
S 2 = (+tt/,% + i, . . . ,w*_i,w*)
Sa = (— Uj,Uj + ii . . . ,u k i,+u k )
S 4 = (— Uj'M+u . . . ,Wfc_i,w*)
Denote the four corresponding least squares estimates of y by the
symbols t(++), 7(+~), 7(+), and ?().
By definition, each of the four conjugate samples has inherited from
the past the same value ?//_i of lagged consumption. In period j the
random disturbance Uj operates positively for samples Si and S 2 and
negatively for S3 and S4. Therefore, in the next period Si and S2
inherit one value v = yyji + u, for lagged consumption, and samples
S3 and S« inherit another value n = 72//. 1 — Uj. By the definition of
conjugates, in periods j + 1, j + 2, . . . , k — 1, equal random
disturbances affect all samples Si to S 4 . Moreover, in period /c,
samples Si and S2 receive an equal inheritance of lagged consumption
from the past. Call it y p . Its exact value can be obtained by applying
model (32) (see Sec. 3.2) to p successively enough times, but this value
is of no interest. Samples S3 and S4 each get the inheritance y n which
likewise arises from the application of model (32) to n. The two
inheritances are different: y p > y nf since p > n.
When we come to period &, samples Si and S2 part company, because
the first receives a boost +u k) and the second receives the opposite
— Uk. For the same reason S3 parts company with S4.
Define
q = TJ P + u k r = yy p  u k v = yy H f u k w = yy n  u h
PROOF OF LEAST SQUARES BIAS IN MODELS OF DECAY 211
The four conjugate estimates are
•V 1/ " ,i2 , i 7 i2 '\ / " ,.2 i ,,2
i/yi T* 2/n i/;l T £'n
Symbolize the sum of these four estimates by Y ^ (± ±) or Y f.
ceaj Si
v«/4. 4.\  2py y i + (g + r)y p 2ny y .i + (g 4 io)y,
\ */, 2 i + 2/p T nJL t +:.yS /
\ vU + vl vU + vlJ
= Ay f residual
The residual is always negative, because y\ > y\ if y>_i > 0, and
2/? < yl if 2/;i < 0 Therefore the average ^ estimate from this
conjugate set of samples is less than the true 7.
Consider an arbitrary sample of size S + 1, which I call sample
B(f). Let it contain observations from time periods ji, j 2 , . . . t js,
js+i (which need not be consecutive). B(+) can be completely
described by the disturbances that generated it plus the predetermined
condition ?/ ; vi.
B(+)  (Vhil*tMu • • • tUfcWiJ
Now consider another sample A which contains one time period
(the last one) less than B(f) but which is in all other respects the
same as B(h):
A  (sfoij u ht u Jv . . . ,u Js )
The conjugate set of A can be described briefly by
conj A « (yy r i; ±u jlf ±u jv . . .,±u Ja )
The conjugate set of B(+) has twice as many samples as conj A; the
elements of conj B(+) can be constructed from elements of conj A by
adding another period in which the disturbance wy a+ , shows up once
with a plus sign and once with a minus sign.
Define B(— ) as the sample consistent with predetermined condition
212 APPENDIX D
y ix i and containing all the disturbances of sample B(f ) identically,
except the last, which takes the opposite sign. Therefore, if
B( + ) = (fr,i;tCft,1ffe • • • ,Wfc,+Wy*J
then 'B() = (y^i) u h ,u jv . . . ,u Ja ,  u ja+l )
Assume that the estimates ^ derived from conj A average less than
the true 7 (0 < 7 < 1). Symbolize this statement as follows:
2t(± ± • ' * ±) <2 S 7
Each ^ in the above sum is a fraction of the form
Let ^(A) stand for the estimate derived from sample A. Then ^(A)
can be expressed as a quotient N/D of two specific sums N and D,
where D is positive. Each sample from the set conj B gives rise to an
estimate of 7. The formulas now are fractions like (1), but the sums
have one more term in the numerator and one more in the denominator,
because one more period is involved. Thus, Writing y f for y Ja)
N + y'W + tffrj N + yy f *  y'u ja _ x
*[B(+)]
Similarly,
It follows that
D + y"* D + 7/' 2
w><>]  N+ tl7>
?[B(+)] + ?[B()] = 2^^
If y(A) > 7, then the last fraction is less than 7(A); if 7(A) — 7,
then the fraction equals 7; if 7(A) < 7, then the fraction is less than 7.
Exactly the same is true for all samples in the conjugate set of A.
Therefore,
2 > < 2 2 v < 2S + l y
conj B conj A
vrhich completes the proof.
I shall not discuss what happens if the value of 7 does not lie between
and 1, because no new principle or new difficulty arises. As an
exercise, find the bias of 7 if 7 lies between and — 1.
APPENDIX E
Completeness and stochastic
independence
/
The proof that s(u g ,u p ) = implies det B 5^ is by contradiction.
If det B = 0, then a nontrivial linear combination of B's rows is (the
zero vector) :
L = Xi?i +•■■••. + \g$q  (1)
Hence, Xigiy 2 + • • • + \o$ y 2 =
But, by the model By + Tz = u, we also have
So XiWi +•••. + \ Q u Q = (Xiyi +.•••+ XoYg)z = Z (2)
Since Z is a constant number for any constellation of values of the
exogenous variables, we have in equation (2) a nontrivial linear relation
among the disturbances U\, . . . , Uq. This contradicts the premise
that they are independent.
This argument shows that £(u 0) u p ) — if and only if det B^O.
213
APPENDIX F
The asterisk notation
A single star (*) means presence in a given equation of the variable
starred; a double star (**) means absence.
Accordingly, in the model
Vl + 7ll*l + 712*2 + 713*3 + 714*4 = Ml (1)
0212/1 + 2/2 + 023?/3 + 721*1 + Y22*2 H" 723*3 = ^2 (2)
0312/1 + 2/3 + 731*1 + 732*2 = u z (3)
with reference to the third equation,
y* means the vector of the endogenous variables present in the
third equation, namely, vec (yi,yz)
y** = vec (#2)
z* = vec (*i,*2)
z** = vec (*3,*4)
For the first equation,
y* ■ vec (?/i)
y** = vec (2/2,2/3)
Z* = VeC (*i,*2,*3,*4)
Stars (single or double) may also be placed on the symbol x, which
214
THE ASTERISK NOTATION 215
stands for all variables, endogenous or exogenous. Thus for the
second equation,
x*  vec (2/1,2/2,2/3,2:1,32,**)
x** = vec (z*)
G* is the number of if a present in the ^th equation ; G** is the number
absent. H* is the number of z's present; H** is the number absent.
Examples:
G* m 1 H* x = 4 Gt* = //?  3 m* m 1 H$* m 2
«*, (#> T? are vectors made up of the nonzero parameters of the
^th equation in their natural order, a here serves as a general symbol,
like x, for all parameters, or 7. Examples:
a? = vec (1,711,712,713,714)
Bf  vec (1)
7? = vec (711,712,713,714)
7? = vec (721,722,723)
aj = vec (031,1,731,732)
yJ = vec (731,732)
In Chap. 8, 1 place stars (or pairs of stars) not on vectors but on the
variables themselves to emphasize their presence in (or absence from)
an equation. For instance, in discussing the third equation above,
we may write y* , t/**, y*, z*, **i z * *> z ** to stress that y h y if f i, z% do
appear in the third equation whereas the other variables 2/1, z%, z K
do not.
Finally, AJ* means the matrix that can be formed from the elements
of A by taking only the columns of A that correspond to the variables
x** absent from the gth. equation. For example,
** _
A?
The columns of A** correspond to x** = vec (2/2,^3).
"0
0"
1
023
.0
1 .
AJ*
corresponding to x** « vec (y^z^zi).
713
714
1
723
Index
Aggregation, 5
Allen, R. G. D., xvii, 51
Assumptions, additivity, error term, 5,
18,22
Simplifying, 917
statistical, 4
stochastic, 4, 6, 95
(See also Error term)
Autocorrelation (see Serial correlation)
Autorcgression, 5253, 61, 168, 195
of the economy, overall, 185186,
195196
Bartlett, M. S., 51
Beach, E. F., xvii, 21, 146, 155
Bennion, E. G., 134
Bias, 3539
instrumental variables, 114
least squares, 38, 47, 68, 72
in models of dccnv, 5262, 209212
secular consumption function, 71
simultaneous interdependence, 6568
(See also Unbiascdncss)
Bogus relations, 8990
Bronfenbrenner, Jean, 72
Bunch map analysis, 146l50j 105
Burns, A. F., 195
Business cycle model, 1711T4, 183184
Causality, 6364, 106, 108, 112113,
119121
Causation, chain of, 119121
Central limit theorem, 14
Characteristic vector (eigenvector), 124
Christ, C. F., 21, 134
Cobweb phenomena, 16
Cochrane, D., 195
Completeness, 86, 106, 213
(See also Nonsingularity)
Computations, 32, 197202
Conflicting observations, 109
Confluence, 103106, 155
linear, 142144
Consistency, 24, 36, 4346, 51, 118
in instrumental variables, 114
in least squares, 44, 4749, 195
in limited information, 118
and maximum likelihood, 44, 4749
217
218
Consistency, notation for, 44
in reduced form, 100, 103
in Thcil's method, 118, 128
Consumption function, cross section
versus time series, 192196
examples, 2, 6372, 137139
secular, bias in, 71
Correcting denominators, 180181
Correlation, 141142
partial, 144145
serial, 168176, 195
spurious, 150
Correlation coefficient, matrix, 145
partial, 144145
Bamplc, 141142, 155
computation of, 200201
universe, 141142, 155
Correlograms, 174175, 195
Cost function, 2122
Counting rules for identification, 9296,
102
Courant, R., 84
Co variance, 19, 27, 8283, 170, 207208
error term, 30, 49
population and sample, 141
Cramer's rule, 35, 203
Credence, 2^, 4i4L», 110
Crosssection versus timeseries esti
mates, 192196
(See also Consumption function)
Cyclical form, 121
(See also Recursive models)
Cyclical indicators, 182
(See also Business cycle model)
David, F. N., 195
Dean, J., 22
Decay, models of, initial conditions in,
53, 6861
least squares bias in, 5262, 171,
209212
unbiased estimation in, 6061
Degrees of freedom, 3
Determinants, 3435
Cramer's rule, 35
Jacobians, 29, 7374, 7982
Diffusion index, 182, 188190, 196
and statistical coverage, 191192
INDEX
Discontinuity, of hypotheses, 137138
probability, 910
Disturbances (see Error term)
Dummy variables, 140, 156157, 165
Efficiency, 24, 47, 51, 118
in instrumental variables, 118
in least squares, 4749
and heteroskedasticity, 48
in limited information, 118
and maximum likelihood, 4749
in Theirs method, 118
Eigenvector, 124
Elasticities, price, and Haavelmo's
proposition, 72
Equations, autorcgrcssivc, 5253
simultaneous, 6771, 7384
notation, 7476
structural, bogus, 8990
Error term, 4, 8, 918
additivity assumption, 5, 18, 22
covariance matrix, 30, 49
Simplifying Assumptions, constancy
of variance (no. 3), 13, 17, 77
(See also Heteroskedasticity)
normally distributed (no. 4), 14, 17,
77
random real variable (no. 1), 9, 17,
77
serial independence of (no. 5), 16,
18,77
uncorrelated, in multiequation
models (no. 7), 7879, 87,
213
with predetermined variable (no.
6), 16, 18, 33, 52, 53, 65n.
zero expected value of (no. 2), 10,
17, 77
Errors, of econometric relationships,
6476
of measurement, 6, 48
in variables, 155
Estimate, variance of, 3943
Estimates in multiequation models, in
terdependence of, 82
Estimating criteria, 6, 8, 23, 47
(See also Consistency; Maximum
likelihood; Unbiascdncss)
INDEX
Estimation, 12, 8, 108110
simultaneous, 6768, 126135
unbiased, in models of decay, 6061
Estimators, extraneous, 134
subsample variances as, 207208
Expectation, 1820
Expected value, 1011, 1921
Factor analysis, linear orthogonal, 160
164
unspecified, 156165
versus variance analysis, 164165
Fisher, R. A., 51
Forecasts, criteria for, 132134
leading indicators as, 186188
Fox, K. A., 21, 134
Friedman, M., 72, 134
Frisch, R., 155
Gartaganis, A. J., 195
Gini, C., 51
Goldberger, A. S., x, 21
Goulden, C. H., 155
Haavelmo, T., 71, 106, 155, 196
Haavelmo proposition, 6466, 7172,
125
Heteroskedasticity, 4849
efficiency, 48
(See also Error term, Simplifying
Assumptions, no. 3)
Hogben, L., 12n., 46, 51
Homoskcdastic equation, 50
(Sec also Heteroskedasticity)
Hood, W. C, xvii, 21, 85, 125
Hurwicz, L., 22, 53, 62
Hypotheses, choice of, 154155
discontinuous, K>7138
maintained, 130137
null, 138
questioned, 137
testing of, 136155
Identification, 34, 64, 85106
counting rules, 9296, 102
219
Identification, exact, 88, 9£Htt, 94, 107
absence of, 91, 128131
of parameters in underidcfUified
equation, 9697, 128
(See also Overidentification ; Under
identification)
Improvement, 44
Incomplete theory, 5
Independence of simultaneous equa
tions, stochastic, 7879, 8C87, 213
Initial conditions in models of decay, 53,
5861
Instrumental variable technique, 107
117
properties, 114
related, to limited information, 118,
125
to reduced form, 113
Instrumental variables, efficiency in,
118
weighted, 116117
Interdependence, simultaneous, 6372
Jacobians, digression on, 7982
of likelihood function, 29
references, 84
and simultaneous equations, 7374
Jaffd, W., x
Jeffreys, H., 51
Kaplan, W., 84
Kendall, M. G., xvii, 21, 50, 51, 165,
173n., 195
Keynes, J. M., 51
Klein, L. R., xvii, 21, 51, 84, 106, 124n.,
125, 134, 155, 201
Koizumi, S., 21
Koopmans, T. C., xvii, 22, 72, 84, 85,
94n., 106, 155, 195
Kuh, E.,134
Lacunes (missing data), 159n.
Lagged model, 5354, 62
(See also Recursive models)
Lango, G., 22
220
INDEX
Leading indicators, 186188
Least squares, 2350
bias, in models of decay, 5262, 209
212
consistency, 44, 4749, 195
diagonal, and simultaneous estima
tion, 0768
directional, 69, 125
efficiency, 4749
and hotoroskodastioity, 48
generalized, 3335
Haavclmo bias, 6567
justification, 3132, 134135
maximum likelihood, 24, 3134, 47
49, 57
naive, compared to maximum likeli
hood, 67, 8283
reduced form, 88, 98, 113, 127,
133
references, 5051, 134135
related to instrumental variables,
113114
relation to limited information, 125
simultaneous estimation, 6770
stepwise, 203206
sufficiency, 4749
unbiasedness, 3539, 4749, 60, 65
67
used in estimating reduced form,
88
Loser, C. E. V., 157
Likelihood, 2425, 50
Likelihood function, 23, 2831
and identification, 9091, 100102
(See also Maximum likelihood)
Limited information, 4, 118125, 128
consistency in, 118
efficiency in, 118
formulas for, 123
relation of, to indirect least squares,
125
to instrumental variables, 118, 125
Linear confluence, 142144
Linear models, multiequation, nota
tion, 7477
versus ratio models, 152153
Linearity, testing for, 150152
Lh, T. C, 130132
Marginal propensity to consume (see
Consumption function)
MarkofT theorem, 195
Marschak, J., 21
Matrix, 3, 27
coefficients, 75
inversion, 201202
moments, 31, 124, 198201
nonsinguln/ity of, 8687
orthogonality, 100164
rank and identification, 9394
triangular (recursive models), 83
Maximum likelihood, 23, 25, 2933, 50,
7884
computation, 83
consistency, 4448
efficiency, 4749
full information, 7384, 118, 129, 133
identification, 100103
interdependence, 8283
simultaneous, 6371
limited information, 118125
subsample variances, 207
value for forecasting, 132133
Mean, arithmetic, expected value, 1011
Measurement, errors of, 6, 48
Meyer, J. It., 134
Miller, II. L., Jr., 134
Mitchell, W. C, 195
Model, 1
business cycle, 171174, 183184
lagged, 5354, 62
latent, 110112
manifest, 110112
recursive, 8384
supplydemand, 89
(See also Linear models)
Model specification (see Specification)
Moments, 3, 20, 3235
algebra of, 51
computation of, 3233, 197200
determinants of, 3435
expectation of, 20
matrices of, 34, 124
raw, 198
from samplo means, 1820
simple (augmented), 20, 198
Moore, Y. II., 134, 187n., 196
Moving average technique, 61, 168
INDEX
Moving average technique, Slutsky
proposition, 173174
Multicollinearity, 3, 103106
etymology, 105106
(See also Confluence; Undcridentifi
cation)
National Bureau of Economic Research,
181182, 186188, 195
Nature's Urn, 1113, 168
subjective belief and, 1112
Ncyman, J., 195
Nonsingularity, 8687
(See also Completeness)
Normal distribution, multivariate, 26
27
univariate, 14—15
Observations, 7
conflicting, 109
(See also Sample)
Orcutt, G. H., 72, 185, 195
Original form, 8788
(See also Reduced form)
Orthogonality, 160164
test of, 163164
and variance analysis, 165
Oscillations, 53, 172174
Overdetcrminacy, 8889
(See also Ovcridentification)
Ovcridentification, 86, 89, 98103, 129
contrasted with underidentification,
100103
Overidentified models, ambiguity in, 98
P lim (probability limit), 44
Parameter estimates, constraints on,
9495
Parameter space and identification,
100102
Parameters, 2, 8
a priori constraints on, 9194
Pareto index, 194
Pearson, K., 51
Periodograms, 175176, 195
Population, 7, 1112, 141
221
Prediction, 2
(See also Forecasts; Leading indica
tors)
Probability, 3, 910, 24
density, 9
inverse, and maximum likelihood, 51
Probability limit (P lim), 44
Production function, 2122
CobbDouglas, 157159
Quandt, R. E., x
Random term (see Error term)
Random variable, 9
Rank, 9394
Ratio models versus linear models, 152
153
Recursive models, 8384, 121
Reduced form, 8788
bias, 100103
dual, Theil'a method, 126128
and instrumental variable technique,
113
leastsquares forecasts, 133
Residual contrasted with error, 8
Sample, 67, 1112, 141142
and consistency. 4346
size, 53
and unbiasedness, 3543
variance, 3943
Samples, conjugate, 52, 5457
Sampling procedure, 621
Scaling of variables, 198
Schultz, H., 22, 9Qn.
Seasonal variation, 176178
removal of, 182
Sectors, independence and nonsingular
ity, 87
split, versus sector variable, 140, 153
154
Serial correlation, 168171, 175, 195
in Simplifying Assumptions, 16, 18,
7778
Sets, conjugate, 5557
of measure zero, 111
222
INDEX
Sbiskin, J., 182n.
Simon, H. A., 106
Simplifying Assumptions, 918
generalization to manyequation
models, 7778
mathematical statement, 1718
sufficient for least squares, 32n.
(Sec also Error term)
Simultaneous estimation, 67^68, 126
135
Slutsky, E., 173175, 195
Solow, R. M., 165
Specification, 1
and identification, 85
imperfect, 5
stochastic, 6
structural, 6
Statistical assumptions, 4
(See also Stochastic assumptions)
Stephan, F. F., x
Stochastic assumptions, 4, 6, 95
(See also Error terms, Simplifying
Assumptions)
Stochastic independence, 7879, 87, 213
Subjective belief, 50
and Nature's Urn, 1112
Subsamplc variances as unbiased esti
mators, 207208
Sufficiency, 47, 51
Suits, D. B., 21
Supplydemand model, 89
Symbols, special, bogus relations, super
script ©, 89
estimates, maximum likelihood, as
in a (hat), 8, 67
naive least squares, as in a (bird),
8, 67
other kinds, unspecified, as in 5
(wiggle), 8
matrices, boldface type, 33
variables, absent from given equa
tion, superscript (**), 91
present in given equation, super
script (*), 91
Symmetrical distribution, 61
Testing of hypotheses, 136155
Theil, H., 116, 126, 134
Theirs method, 126128
consistency in, 118, 128
efficiency in, 118
stepwise leastsquares computation,
203206
Time series, 166196
cyclical, 172
generated by structural models, 1 83
184
longterm, abuse of, 190
National Bureau analysis, 181182,
186188, 195
random series and Slutsky proposi
tion, 173
Timeseries estimates versus crosssec
tion estimates, 192195
Timing indicator, 186188
Tinbergen, J., xvii, 21, 134, 155, 185,
195
Tintner, G., xvii, 21, 51, 62, 103
Trend removal, 178181
Unbiasedness, 24, 3537, 4546, 51, 118
in instrumental variables, 114
in least squares, 3539, 4749, 60, 65
67
in limited information, 118
reasons for, 53
in reduced form, 100, 103
in subsample variance estimators,
207208
in Theirs method, 128
(See also Bias)
Uncertainty Economics, 45
Underdeterminacy, 8889
Underidentificatbn, 86, 8891, 96, 100
106, 128
of all structural relationships, 130
Universe, 1112, 141142
Unspecified factors, 156165
(See also Sectors, split)
Urn of Nature (see Nature's Urn)
Valavanis, 8., 22, 196
Variables, endogenous, 2, 64
errors in {see Errors in variab'es)
exogenous, 2, 64
INDEX
223
Variables, independent (exogenous), 2, Watts, H. W., 165
64
scaling, 198
standardized, 145
Variance, 19, 4041
covariance, 2731, 8283, 140141,
170
of estimate, 3943
estimates of, 41
infinite, 14
subsample as estimator, 207208
Variance analysis and factor analysis,
1641 C5
Variate difference method, 21, 179181
Verification, 2, 136155
Weights, arbitrary, 50
instrumental variables, 109111, 116
117
and least squares, 109110, 124
Wilson, E. B., 195
Wise, J., 195
Wold, H., 135
Working.. E. J., 106
Yule, G. U., 51
Zero restriction, 92
,
Due
Date Due
Returned Due
Returned
W£W BOOK
^
~pfU J~ brtfifrt
>eryxnea
St£&J[£ii
Z&a IsQSlt
La.
JUL*
#
QJXL
AUG l 881
JUL 2 1981
AUG 3 1 mi
BHtinxm
SCIHSSL